Method and system for describing and identifying concepts in natural language text for information retrieval and processing

ABSTRACT

A method for information retrieval that matches occurrences of concepts in natural language text documents against descriptions of concepts in user queries. Said method, implemented in a computer system, includes a preferred version of the method that comprises (1) annotating natural language text in documents and other text-forms with linguistic information and Concepts and Concept Rules expressed in a Concept Specification Language (CSL) for a particular domain, (2) pruning and optimizing synonyms for a particular domain, (3) defining and learning said CSL Concepts and Concept Rules, (4) checking user-defined descriptions of Concepts represented in CSL (including user queries), and (5) retrieval by matching said user-defined descriptions (and queries) against said annotated text. CSL is a language for expressing linguistically-based patterns. Said patterns can represent the linguistic manifestations of concepts in text. Said concepts may derive from the sublanguages used by experts to analyze specialized domains including, but not limited to, insurance claims, police incident reports medical reports, and aviation incident reports.

CROSS-REFERENCE TO RELATED APPLICATION

This application is a National Stage Application of InternationalApplication No. PCT/CA01/01398, filed Sep. 28, 2001, which claims thebenefit under 35 USC 119(e) benefit of U.S. Provisional PatentApplication No. 60/236,342, filed Sep. 29, 2000, where this provisionalapplication is incorporated by reference in its entirety.

TECHNICAL FIELD

The present invention relates most directly to the field of informationretrieval, but also to the fields of query languages and queryspecification, text processing and linguistic annotation, and machinelearning/knowledge acquisition.

BACKGROUND OF THE INVENTION 1. Information Retrieval

Information retrieval (IR) refers to “the retrieval of documents . . .from large but specialized bodies of material, through the specificationof document content rather than via keys like author names” (SparckJones, K., “Information Retrieval,” In S. C. Shapiro (Ed.) Encyclopediaof Artificial Intelligence, 2^(nd) Edition. John Wiley & Sons, New York,N.Y., pp. 1605-1613 (1992). In this invention the focus is on textretrieval (as opposed to audio or video retrieval). The problem of textretrieval has two parts: query specification, and retrieval of textdocuments.

Most IR systems use statistical techniques, depending on probabilisticmodels of query term distribution. For example, the Vector-Spaceapproach is based on word occurrences (S{dot over (a)}lton, G., TheSmart Retrieval System—Experiments in Automatic Document Processing,Prentice Hall, Englewood Cliffs, N.J. (1971). Documents and queries arerepresented in a multi-dimensional space where each dimensioncorresponds to a word in the document collection. Relevant documents arethose whose vectors are closest to the query vector.

In contrast to the Vector-Space focus on word occurrence distribution,approaches like Latent Semantic Indexing (Deerwester, S., Dumais, S.,Furnas, G., Landauer, T., and R. Harshman, “Indexing by Latent SemanticAnalysis,” Journal of the American Society for Information Science, 41,pp. 391-407 (1990) are concerned with word co-occurrences. In thisapproach, the fact that two or more terms occur in the same documentmore often than chance is exploited to reduce the dimensionality ofdocument and query space. Recent variants of these two basic approachesare described in Robertson, S. E., and S. Walker, “Okapi/Keenbow atTREC-8,” In E. M. Vorhees and D. K. Harman (Eds.), Proceedings of theEighth Text Retrieval Conference (TREC-8), Gaithersburg, Md.: Departmentof Commerce, National Institute of Standards and Technology, NISTSpecial Publication 500-246, pp. 151-162 (1999); and Kwok, K., and M.Chan, “Improving Two-Stage Ad Hoc Retrieval for Short Queries,” InProceedings of SIGIR '98, pp. 250-256 (1998).

In contrast to these statistical approaches, there has been limited workthat focusses on a rule-based or linguistic approach to IR. When used,such approaches are often combined with a statistical approach(Strzalkowski, T., Perez-Carballo, J., Karigren, J., Hulth, A.,Tapanainen, P., and T. Lahtinen, “Natural Language InformationRetrieval: TREC-8 report,” In Proceedings of the Eighth Text RetrievalConference (1999).

The present invention differs from the current state of the art in twokey respects: it allows users to specify richer queries using a queryspecification language and it offers a richer annotation of documentsthat might be potentially retrieved. On top of these improvements ismore sophisticated matching of the richer user queries against the morerichly annotated documents. Unlike the statistical approaches, thepresent invention does not rely on the results of ranking to determinewhether a document should be returned to the user. Instead, it matchesdocuments against the concepts specified by the user. Ranking can beadditionally used to prioritize documents, but it is not mandatory.

The present invention also differs from previous research because itdoes not only judge whether a document is relevant or not to a givenquery, it also identifies the specific portions of text that match theconcept expressed by the input query. Therefore the present invention isnot only relevant to IR purposes, but also to other fields likesummarization and text mining.

Many patents describe methods and systems for information retrieval. Twoof the most comprehensive recent IR patents are by Liddy et al. U.S.Pat. Nos. 5,963,940 and 6,026,388, but these differ markedly from thepresent invention. In particular, they do not provide its rich forms ofannotation.

2. Query Languages and Query Specification

Query languages are traditionally associated with databases, i.e.,repositories of structured data. Attempts have been made to extenddatabase techniques and query languages to semi-structured data,particularly with the advent of the Worldwide Web (see, e.g., Buneman,P., “Semistructured Data.” In Proceedings of the ACM Symposium onPrinciples of Database Systems, Tucson, Ariz. Invited tutorial (1997)).An attempt at creating a query language for the XML markup language canbe found in Deutsch, A., Fernandez, M., Florescu, D., Levy, A., and D.Suciu, “XML-QL: A Query Language for XML.” Submission to the World WideWeb Consortium 19 Aug. 1998 (August 1998).

The limitations of the current approaches to querying unstructured dataare pointed out in Lacroix, Z., Sahaguet, A., Chandrasekar, R., and B.Srinivas, “A Novel Approach to Querying the Web: Integrating Retrievaland Browsing.” In ER97 Workshop on Conceptual Modeling for MultimediaInformation Seeking, Los Angeles, Calif., USA (November 1997). Mostnotably, the present invention suggests integrating query languages withsome sort of text processing and IR techniques.

In a typical IR session, users specify their information needs by usingkey words, or by typing their requests using a natural language such asEnglish (Manning, C., and H. Schutze, Foundations of Statistical NaturalLanguage Processing, MIT Press, Cambridge, Mass. (1999)). Querylanguages in IR have been mostly confined to simple approaches like thebag-of-words approach. The IR. system then converts these query wordsinto a format relevant for the particular IR engine.

The present invention differs from these approaches in that it uses aspecification language for the user to specify a query. Thespecification language allows the user to specify lexical, syntactic orsemantic relationships. Using the specification language, the userspecifies concepts that are required in the document, rather than justkey words or entire natural language sentences. Concepts specified inthis way are more accurate than simple keywords, and more general andflexible than complete sentences.

3. Text Processing and Linguistic Annotation

Linguistic annotation schemas exist for various kinds of linguisticinformation: phonetics, morphology, syntax, semantics, and discourse.See Bird, S., and M. Lieberman, “A Formal Framework for LinguisticAnnotation.” University of Pennsylvania, Dept. of Computer andInformation Science, Technical Report MS-CIS-99-01 (1999) for adefinition of ‘linguistic annotation’ and a review of the literature onlinguistic annotation.

U.S. Pat. No. 5,331,556, Black et al. entitled “Method for naturallanguage data processing using morphological and part-of-speechinformation” discloses in the abstract, “the method includes executinglinguistic analysis upon a text corpus file to derive morphological,part-of-speech information as well as lexical variants . . . toconstruct an enhanced text corpus file. A query text file islinguistically analyzed to construct a plurality of trigger tokenmorphemes . . . used to construct a search mask stream . . . A matchbetween the search mask stream and the enhanced corpus file allows auser to retrieve selected portions of the enhanced text corpus.”However, U.S. Pat. No. 5,331,556 describes only a relatively generalform of annotation, unlike the rich forms of annotation described in thepresent invention.

4. Machine Learning/Knowledge Acquisition

Machine learning (ML) refers to the automated acquisition of knowledge,especially domain-specific knowledge (cf.Schlimmer, J. C., and P.Langley, “Learning, Machine,” In S. C. Shapiro (Ed.) Encyclopedia ofArtificial Intelligence, 2^(nd) Edition. John Wiley & Sons, New York,N.Y., pp. 785-805 (1992), p. 785). In the context of the presentinvention, ML concerns learning Concepts.

The system most closely related to the present task is Riloff's (1993)AutoSlog, a knowledge acquisition tool that uses a training corpus togenerate proposed extraction patterns for the CIRCUS extraction system.See Riloff, E., “Automatically Constructing a Dictionary for InformationExtraction Tasks,” In Proceedings of the Eleventh National Conference onArtificial Intelligence (AAAI-93), pp. 811-816 (1993). A user eitherverifies-or rejects each proposed pattern.

J.-T. Kim and D. Moldovan's (1995) PALKA system is a ML system thatlearns extraction patterns from example texts. See Kim, J.-T., and D. I.Moldovan, “Acquisition of Linguistic Patterns for Knowledge-BasedInformation Extraction,” IEEE Transactions on Knowledge and DataEngineering, 7 (5), pp. 713-724 (October 1995). The patterns are builtusing a fixed set of linguistic rules and relationships. Kim andMoldovan do not suggest how to learn syntactic relationships that can beused within extraction patterns learned from example texts.

In Transformation-Based Error-Driven Learning (See Brill, E., “ACorpus-Based Approach to Language Learning,” PhD. Dissertation,Department of Computer and Information Science, University ofPennsylvania, Philadelphia, Pa. (1993)), the algorithm works bybeginning in a naive state about the knowledge to be learned. Forinstance, in tagging, the initial state can be created by assigning eachword its most likely tag, estimated by examining a tagged corpus,without regard to context. Then the results of tagging in the currentstate of knowledge are repeatedly compared to a manually tagged trainingcorpus and a set of ordered transformations is learnt, which can beapplied to reduce tagging errors. The learnt transformations are drawnfrom a pre-defined list of allowable transformation templates. Theapproach has been applied to a number of other NLP tasks, most notablyparsing (See Brill, E., “Transformation-Based Error-Driven Parsing,” InProceedings of the Third International Workshop on Parsing Technologies.Tilburg, The Netherlands (1993)).

The Memory-Based Learning approach is “a classification based,supervised learning approach: a memory-based learning algorithmconstructs a classifier for a task by storing a set of examples. Eachexample associates a feature vector (the problem description) with oneof a finite number of classes (the solution). Given a new featurevector, the classifier extrapolates its class from those of the mostsimilar feature vectors in memory” (See Daelemans, W., S. Buchholz, andJ. Veenstra, “Memory-Based Shallow Parsing,” In Proceedings of theComputational Natural Language Learning (CoNLL-99) Workshop, Bergen,Norway, 12 Jun. 1999 (1999)).

Explanation-Based Learning is “a technique to formulate general conceptson the basis of a specific training example” (van Harmelen, F., and A.Bundy, “Explanation-Based Generalization=Partial Evaluation (ResearchNote),” Artificial Intelligence, 36, pp. 401-412 (1988) ). A singletraining example is analyzed in terms of knowledge about the domain andthe goal concept under study. The explanation of why the trainingexample is an instance of the goal concept is then used as the basis forformulating the general concept definition by generalizing thisexplanation.

Huffman, U.S. Pat. Nos. 5,796,926 and 5,841,895 describe methods forautomatic learning of syntactic/grammatical patterns for an informationextraction system. The present invention also describes methods forautomatically learning linguistic information (includingsyntactic/grammatical information), but not in ways described byHuffman.

BRIEF DESCRIPTION OF THE DRAWINGS

In drawings which illustrate a preferred embodiment of the invention:

FIG. 1 is a hardware block diagram showing an apparatus according to theinvention;

FIG. 2 is a block diagram of the information retriever module shown inFIG. 1;

FIG. 3 is a block diagram of the text document annotator module shown inFIG. 2;

FIG. 4 is a block diagram of the annotator module shown in FIG. 3;

FIG. 5 is a block diagram of the linguistic annotator module shown inFIG. 4;

FIG. 6 is a block diagram of the preprocessor module shown in FIG. 5;

FIG. 7 is a block diagram of the tagger module shown in FIG. 5;

FIG. 8 is a block diagram of the parser module shown in FIG. 5;

FIG. 9 is a block diagram of the conceptual annotator module shown inFIG. 4;

FIG. 10 is a block diagram of the Concept identifier module shown inFIG. 9;

FIG. 11 is a simplified Text Tree for the cat chased the dog;

FIG. 12 is a simplified Concept Tree for the Concept Animal containingthe Concept Rules “the¹ Precedes¹ dog” and “the² Precedes² cat”;

FIG. 13 is a block diagram of the finite state automata for thesingle-term Pattern A;

FIG. 14 is a block diagram of the finite state automata for A PrecedesB;

FIG. 15 is a simplified Concept Tree for the Concept “VX Dominates dog”;

FIG. 16 is a block diagram of the finite state automata for “VXDominates dog”;

FIG. 17 is a block diagram of the finite state automata for A OR B;

FIG. 18 is a block diagram of the finite state automata for ConceptAnimal containing the Concept Rules “the¹ Precedes¹ dog” and “the²Precedes² cat”;

FIG. 19 is a block diagram of the index-based matcher module shown inFIG. 10.

FIG. 20 is a block diagram of the synonym processor module shown in FIG.2 comprising a synonym pruner and synonym optimizer;

FIG. 21 is a block diagram of the synonym processor module shown in FIG.2 comprising a synonym pruner;

FIG. 22 is a block diagram of the synonym processor module shown in FIG.2 comprising a synonym optimizer;

FIG. 23 is a block diagram of the synonym pruner module shown in FIG. 20and FIG. 21 comprising manual ranking, automatic ranking, and synonymfiltering;

FIG. 24 is a block diagram of the synonym pruner module shown in FIG. 20and FIG. 21 comprising manual ranking and synonym filtering;

FIG. 25 is a block diagram of the synonym pruner module shown in FIG. 20and FIG. 21 comprising automatic ranking and synonym filtering;

FIG. 25 a is a block diagram of the synonym pruner module shown in FIG.20 and FIG. 21 comprising automatic ranking, human evaluation, andsynonym filtering;

FIG. 26 is a block diagram of the synonym optimizer module shown in FIG.20 and FIG. 22 comprising removal of irrelevant and redundant synonymyrelations;

FIG. 27 is a block diagram of the synonym optimizer module shown in FIG.20 and FIG. 22 comprising removal of irrelevant synonymy relations;

FIG. 28 is a block diagram of the synonym optimizer module shown in FIG.20 and FIG. 22 comprising removal of redundant synonymy relations;

FIG. 29 is a block diagram of the CSL processor module shown in FIG. 2;

FIG. 30 is a block diagram of the CSL Concept and Concept Rule learnermodule shown in FIG. 29 with linguistically annotated documents asinput;

FIG. 31 is a block diagram of the CSL Concept and Concept Rule learnermodule shown in FIG. 29 with text documents as input;

FIG. 32 is a block diagram of the CSL Rule creator module shown in FIG.30 and FIG. 31;

FIG. 33 is a block diagram of the CSL query checker module shown in FIG.29;

FIG. 34 is a block diagram of the CSL parser module shown in FIG. 2;

FIG. 35 is a block diagram of the text document retriever module shownin FIG. 2.

DESCRIPTION

The present invention is described in three sections. Two versions of amethod for information retrieval (IR) are described in section 1. Twoversions of a system for IR are described in section 2. One system usesthe first method of section 1; the second system uses the second method.The preferred embodiment of the present invention is the second system.Finally, a concept specification language called CSL (short for ConceptSpecification Language) is described in section 3.

The term “document” herein includes any body of data including text,collection of data including text, storage medium containing textualdata or other text-forms.

Method

Two versions of a method for IR are described. The first method usesconcept specification languages in general and—though notnecessarily—text markup languages in general. The second method uses CSLand—though not necessarily—TML (short for Text Markup Language), a typeof text markup language. Both methods can be performed on a computersystem or other systems or by other techniques or by other apparatus.

1.1. Method Using Concept Specification Languages and (Optionally) TextMarkup Languages

The first IR method uses concept specification languages in generaland—though not necessarily—text markup languages in general. That is tosay, the first method necessarily uses concept specification languagesin general, but does not require the use of a text markup language. Themethod matches text in documents and other text-forms againstuser-defined descriptions of concepts, and comprises up to eleven steps,which are now described.

Step (1) is the identification of linguistic entities in the text ofdocuments and other text-forms. The linguistic entities identified instep (1) include, but are not limited to, morphological, syntactic, andsemantic entities. The identification of linguistic entities in step (1)includes, but is not limited to, identifying words and phrases, andestablishing dependencies between words and phrases. The identificationof linguistic entities is accomplished by methods including, but notlimited to, one or more of the following: preprocessing, tagging, andparsing.

Step (2) is the annotation of those identified linguistic entities fromstep (1) in, but not limited to, a text markup language to producelinguistically annotated documents and other text-forms. The process ofannotating the identified linguistic entities from step (1), is known aslinguistic annotation.

Step (3), which is optional, is the storage of these linguisticallyannotated documents and other text-forms.

Step (4) is the identification of concepts using linguistic information,where those concepts are represented in a concept specification languageand the concepts-to-be-identified occur in one of the following forms:

-   -   text of documents and other text-forms in which linguistic        entities have been identified as per step (1), or    -   the linguistically annotated documents and other text-forms of        step (2); or    -   the stored linguistically annotated documents and other        text-forms of step (3).

A concept specification language allows representations (e.g., rules) tobe defined for concepts in terms of a linguistics-based pattern or setof patterns. Each pattern (or phrasal template) consists of words,phrases, other concepts, and relationships between words, phrases, andconcepts. For example, the concept HighWorkload is linguisticallyexpressed by the phrase high workload. In a concept specificationlanguage, patterns can be written that look for the occurrence of highand workload in particular syntactic relations (e.g., workload as thesubject of be high; or high and workload as elements of the same nominalphrase, e.g., a high but not unmanageable workload). Expressions canalso be written that seek not just the words high and workload, but alsotheir synonyms.

All methods for identifying concepts work by matching linguistics-basedpatterns in a concept representation language against linguisticallyannotated texts. A linguistics-based pattern from a conceptrepresentation language is a partial representation of linguisticstructure. Each time a linguistics-based pattern matches a linguisticstructure in a linguistically annotated text, the portion of textcovered by that linguistic structure is considered an instance of theconcept.

Methods for identifying concepts can be divided into non-index-basedmethods for identifying concepts and index-based methods.Non-index-based methods for identifying concepts include, but are notlimited to,

-   -   compiling the concept specification language into finite state        automata (FSAs) and matching those FSAs against linguistically        annotated documents,    -   recursive descent matching, and    -   bottom-up matching.

Recursive descent matching consists of traversing a conceptspecification expression and recursively matching its constituentsagainst linguistic structures in annotated text. Bottom-up matchingconsists of the bottom-up generation of spans for words and constituentsfrom linguistic structures in annotated text, and matching those spansagainst expression in the concept specification language. (A span is oneor more words or constituents that follow each other plus, optionally,structural information about those words and constituents.)

Index-based methods for identifying concepts employ an inverted index.An inverted index contains words, constituents, and (if available) tagsfor linguistic information present in linguistically annotated text. Theindex also contains spans for those words, constituents, and tags fromlinguistically annotated text.

Index-based methods for identifying concepts include, but are notlimited to,

-   -   simple index-based matching, and    -   candidate checking index-based matching.

In simple index-based matching, iterators are attached to all the itemsin the expression in a concept specification language, then uses indexinformation about the state of each iterator to generate and match spansthat, if successful, cover the text selected for concept identification;

In candidate checking index-based matching, sets of candidate spans areidentified, where a candidate span is a span that might, but does notnecessarily, contain a concept to be identified (matched). Any span thatis not covered by a candidate span from the sets of candidate spans isone that cannot contain a concept to be identified (matched). Eachsubexpression of an expression in the concept specification language isassociated with a procedure, and each such procedure is used to generatecandidate spans or to check whether a given span is a candidate span.These candidate spans can serve as input to the four other conceptidentification methods just described, plus any other conceptidentification method.

Compiling and matching finite state automata, recursive descentmatching, bottom-up matching, and any other possible conceptidentification method could be made into index-based methods byemploying an inverted index.

Step (5) is the annotation of the concepts identified in step (4), e.g.,concepts like HighWorkload, to produce conceptually annotated documentsand other text-forms. (conceptually annotated documents are alsosometimes referred to in this description as simply “annotateddocuments.”) The process of annotating the identified concepts from step(5) is known as conceptual annotation. As with step (2), conceptualannotation is in, but is not limited to, a text markup language.

Step (6), which is optional like step (3), is the storage of theseconceptually annotated documents and other text-forms.

Steps (7), (8), and (9) are optional and independent of each other andthe other steps. Step (7) is synonym pruning, which takes some synonymresource as input and establishes a set of domain-specific synonyms ofnatural language words and phrases for a specific knowledge domain. Thepruning method either a manual pruning step or an automatic pruning or acombination of the two, followed by filtering. Manual pruning is appliedto the synonymy relations more relevant in the specific domain. Asynonymy relation is a relationship between two terms that are notsynonyms. Relevance is measured by a score based on the frequency ofwords in a domain specific corpus.

The method for automatically pruning assigns a score to candidatesynonymy relations, based on the frequency of the relevant words, andother semantically related terms, in a domain specific corpus. Duringfiltering, a filtering threshold is set and applied, and all candidateswith a score beyond the threshold are eliminated.

Step (8) is synonym optimization in which a synonym resource (such as amachine-readable dictionary) is optimized by removing irrelevant sets ofsynonyms, or redundant sets of synonyms, or both. In optimization, suchsynonyms are removed in order to increase the accuracy and reduce thesize of the synonym resource.

Step (9) is defining and learning the concept representations of theconcept specification language, where the concept representations to bedefined and learned include, but are not limited to, hierarchies, rules,operators, patterns, and macros.

Concept representations can be either defined by an operator or acquired(that is, learned) from a corpus. The learning of conceptrepresentations from corpora includes, but is not limited to,highlighting instances of concepts in the unprocessed text (orlinguistically annotated text) of documents and other text-forms, thencreating new concept representations in the concept specificationlanguage from those highlighted instances of concepts, then adding and,if necessary, integrating those concept representations in the conceptspecification language with pre-existing concept representations fromthe language.

The method of creating new concept representations in step (9) includes,but is not limited to,

-   -   using the concept identification methods of step (4) to match        together concept specification language vocabulary        specifications and highlighted linguistically annotated        documents and other text-forms, then    -   defining linguistic variants; then    -   adding synonyms from a set of synonyms, possibly supplied by        synonym pruning and optimization, and then    -   adding part of speech information as appropriate.

Step (10) is checking user-defined descriptions of concepts representedin the concept specification language. A common form of checking ischecking user queries represented in the concept representationlanguage, since user queries are a type of user-defined description ofconcepts. When checking user queries, those queries are analyzed.Depending on the query content and the current state of knowledge aboutqueries available courtesy of any repositories of pre-stored queries, adecision is made as to the appropriate way to process the incoming userquery.

If all known queries are described in the concept representationlanguage, then a proposed query is itself described in therepresentation language and is subsequently used by the retrieval methodof step (11).

If all queries to be described in the concept representation languageare not known in advance, then a proposed query is described in therepresentation language and matched against the repository of pre-storedqueries. If a match is found, the query is subsequently used by themethod of retrieval. If a match is not found, the proposed query issubsequently sent for concept identification as per step (4).

Step (11) is the retrieval of text documents and other text-forms.Retrieval is based on matching user-defined descriptions of concepts(e.g., user queries) against conceptually annotated documents and othertext-forms. The format of the output of retrieval depends on theapplication task for which the system has been configured (e.g.,document ranking, document categorization, etc.).

1.2. Method Using CSL and (Optionally) TML

The second IR method uses CSL and—though not necessarily—TML, a type oftext markup language. That is to say, the method necessarily uses CSL,but does not require the use of TML.

CSL is a language for expressing linguistically-based patterns. It iscomprised of tag hierarchies, Concepts,. Concept Rules, Patterns,Operators, and macros. One type of CSL Pattern is a “single-termPattern” which may refer to the name of a word, and optionally, its partof speech tag (a simple lexical tag or phrasal tag, if a tagger isused), and also optionally, synonyms of the word. Another type of CSLPattern is a “configurational Pattern.” Configurational Patterns mayhave the form A Operator B, where the operator can be, among others, aBoolean (such as OR) and can express relationships such as dominance andprecedence. Section 3 gives a definition of CSL.

TML presently has the syntax of XML (an abbreviation of (extensibleMarkup Language), though TML isn't required to have the same syntax. Atext annotated with TML could be used for a variety of natural languageprocessing tasks such as information retrieval, text categorization, oreven text mining. TML can easily be maintained and modified bynon-experts. More lengthy descriptions of TML are given later.

The second method consists of the same basic steps, and relationshipsamong the steps, as the first method. There are two differences betweenthe two methods. The first difference is that wherever a conceptspecification language is used in the first method, CSL is used in thesecond. The second difference is that where ever a text markup languageis referred to in the first method, TML is used in the second.

Hence, for example, in step (4) of the second method, the conceptspecification language is CSL and the step consists of identifying CSLConcepts and Concept Rules using linguistic information, not identifyingthe concepts of concept specification languages in general. A preferredembodiment of the second method is given in section 2.3.

2. System

Two versions of an IR system, using a common computer architecture, aredescribed. One system employs the method described in section 1.1; henceit uses concept specification languages in general and—though notnecessarily—text markup languages in general. The other system employsthe method described in section 1.2; hence it uses CSL and—though notnecessarily—TML. The preferred embodiment of the present invention isthe second system. First, however, the common computer architecture isdescribed.

2.1. Computer Architecture

FIG. 1 is a simplified block diagram of a computer system embodying theinformation retrieval system of the present invention. The invention istypically implemented in a client-server configuration including aserver 105 and numerous clients connected over a network or othercommunications connection 110. The detail of one client 115 is shown;other clients 120 are also depicted. The term “server” is used in thecontext of the invention, where the server receives queries from(typically remote) clients, does substantially all the processingnecessary to formulate responses to the queries, and provides theseresponses to the clients. However, the server 105 may itself act in thecapacity of a client when it accesses remote databases located on adatabase server. Furthermore, while a client-server configuration is oneoption, the invention may be implemented as a standalone facility, inwhich case client 120 would be absent from the figure.

The server 105 comprises a communications interface 125 a to one or moreclients over a network or other communications connection 110, one ormore central processing units (CPUs) 130 a, one or more input devices135 a, one or more program and data storage areas 140 a comprising amodule and one or more submodules 145 a for an information retriever 150or processes for other purposes, and one or more output devices 155 a.

The one or more clients comprise a communications interface 125 b to aserver over a network or other communications connection 110, one ormore central processing units (CPUs) 130 b, one or more input devices135 b, one or more program and data storage areas 140 b comprising oneor more submodules 145 b for an information retriever 150 or processesfor other purposes, and one or more output devices 155 b.

2.2. System Using Concept Specification Languages and (Optionally) TextMarkup Languages

The first system uses the computer architecture described in section 2.1and FIG. 1. It also uses the method described in section 1.1; hence ituses concept specification languages in general and—though notnecessarily—text markup languages in general. A description of thissystem can be assembled from sections 1.1 and 2.1. Although notdescribed in detail within this section, this system constitutes part ofthe present invention.

2.3. System Using CSL and (Optionally) TML

The second system uses the computer architecture described in section2.1 and FIG. 1. This system employs the method described in section 1.2;hence it uses CSL and—though not necessarily—TML. The preferredembodiment of the present invention is the second system, which will nowbe described with reference to FIGS. 2 to 35. The system is written inthe C programming language, but could be embodied in any programminglanguage. The system is an information retriever and is described insection 2.3.1.

2.3.1. Information Retriever

FIG. 2 is a simplified block diagram of the information retriever 206.The information retriever takes as input text in documents and othertext-forms in the form of a signal from one or more input devices 203 toa user interface 212, and carries out predetermined informationretrieval processes to produce a collection of text in documents andother text-forms, which are output from the user interface 212 in theform of a signal to one or more output devices 209.

The user interface 212 comprises windows for the loading of textdocuments, the processing of synonyms, the identification of concepts,the definition and learning of concepts, the formulation of userqueries, and the viewing of search results.

The predetermined information retrieval processes, accessed by the userinterface 212, comprise a text document annotator 215, synonym processor218, CSL processor 221, CSL parser 227, and text document retriever 230.All these processes are described below. During these descriptions, allthe boxes in FIG. 2 not mentioned in this section will be referred to,e.g., Section 2.3.10. on the synonym processor 218, refers to thesynonym resource 242 and the processed synonym resource 242.

2.3.2. Text Document Annotator

FIG. 3 is a simplified block diagram of the text document annotator 310.The text document annotator 310, accessed by the user interface 305,comprises a document loader 365 from a document database 315, whichpasses text documents 370 to the annotator 375. The annotator 375outputs annotated documents 355.

2.3.3. Annotator

FIG. 4 is a simplified block diagram of the annotator 460. The annotator460 takes as input one or more text documents 455 and outputscorresponding documents, where the text is augmented with annotationsrepresenting linguistic and conceptual information 440. The annotator460 is comprised of a linguistic annotator 465 which passeslinguistically annotated documents 425 to a conceptual annotator 470.

The linguistically annotated documents 425 may be annotated in TextMarkup Language (TML) 435 by passing them through a TML converter 430(or converter for some other markup language), and may be stored 435.

TML presently has, but is not limited to, the syntax of XML (extensibleMarkup Language), but could have the syntax of any markup language. TMLcould have any syntax, but the present invention uses the syntax of XMLfor efficiency reasons. Because TML presently has the syntax of XML, theTML converter used by the system 430 is presently also an XML converter,though the TML converter 430 is not of necessity also an XML converter.

A TML object is a tree-structure, where each node is associated with alabel. The top element (i.e., the root of the tree-structure) is a label(e.g., ‘TML’) identifying the structure as a TML object. In thefollowing we refer to elements as nodes in the tree-structure, each ofwhich associated with a label. Each element specifies some kind ofobject, identified by an appropriate label. Within the top-level nodeare text elements (associated with a label like ‘TEXT’, for instance).Within each text element there are a number of sentence elements. Withinthe sentence elements are grammatical elements, i.e., syntacticconstituents comprising a sentence. The grammatical elements areidentified by grammatical tags, as assigned by a parser operated by theinformation retriever (206 in FIG. 2). Grammatical elements can have arole element associated with them, i.e., in addition to a labelidentifying their type of constituent, they can have a further labelidentifying their role in the sentence (subject, object, etc.). Eachgrammatical element (constituent) can in turn comprise furtherconstituents. At the bottom of a constituent hierarchy there aretext-word elements. Each text-word element has one word in it. Each wordhas a base element (i.e., a node representing its base form, or lemma),a role element (as previously defined), and a tag element (defining theword's part of speech).

It is not necessary to impose a Document Type Definition (DTD) on TML,unlike other text markup languages that have the syntax of XML, becausethe DTD would have to duplicate the type definition defined in thegrammar used by the system's parser, which would be redundant.

2.3.4. Linguistic Annotator

FIG. 5 is a simplified block diagram of the linguistic annotator 535.The linguistic annotator 535 takes as input one or more text documents515, preprocessing rules (i.e., rewriting rules mapping input textexpressions onto corresponding output expressions) 505, and informationfrom a database of abbreviations 510. The linguistic annotator 535outputs one or more linguistically annotated documents 520. Theannotated information represents linguistic information about the inputtext documents 515.

The linguistic annotator 535 is comprised of a preprocessor 550, tagger560, and parser 570. In the preferred embodiment, all three are used, asshown in FIG. 5; however, any configuration of these (or any otherlinguistic identification) as part of the information retriever shown inFIG. 2 is within the scope of the present invention.

2.3.5. Preprocessor

FIG. 6 is a simplified block diagram of the preprocessor 620. Thepreprocessor takes as input one or more text documents 615 and outputsone or more preprocessed (or tokenized) documents 655. The input text isa continuous stream of text. The output is the same text tokenized,i.e., broken into discrete units (words, punctuation, etc.), eachaugmented with information about its type (word, symbol, number, etc.).The preprocessor performs the following:

-   -   a) breaks a continuous stream of text into discrete units        (words) 630 using preprocessing rules 605;    -   b) marks phrase boundaries 635;    -   c) identifies numbers, symbols, and other punctuation 640 using        preprocessing rules 605;    -   d) expands abbreviations 645 using an abbreviations database        610; and    -   e) splits apart contractions 650.

The preprocessor 620 is responsible for breaking text into words 630using preprocessing rules 605. The preprocessor 620 takes as inputnon-annotated text. It assigns a marker to each word found and outputswords on individual lines separated by a tab character. For instance,the following could be chosen as a set of valid markers:

_num a number “one”, “2nd” . . . _sym a symbol “$”, “%”, “+” . . ._punct punctuation “(” “, ” “&” . . . _word anything else

The preprocessor 620 is also responsible for marking phrase boundaries635. Marking a phrase boundary includes, but is not limited to,identifying sentence final punctuation, such as a period.

The preprocessor 620 identifies numbers, symbols, and other punctuation640 using preprocessing rules 605. When sentence final punctuation isfound, the preprocessor outputs the punctuation followed by an end ofphrase marker “<<EOP>>”. Sentence final punctuation is defined by thefollowing: “!”, “.”, “. . . ”, “:”, “;”, and “?”.

The preprocessor 620 employs an abbreviation expander 645. The expander645 replaces abbreviations. It works on character strings and returns astring in which the abbreviation is expanded.

The abbreviation expander 645 uses as a knowledge resource anabbreviation database 610, though it could use other resources. In theabbreviation database 610, each line contains an abbreviation and itsexpansion separated by a single tab character.

The preprocessor 620 also splits apart contractions 650. Contractionsare split into separate words. Some examples of contractions split intocomplete words are shown in Table 1.

TABLE 1 Contraction Word 1 Word 2 I'm I 'm I'd I 'd I'll I 'll I've I've you're you 're don't do n't deans' deans s'

2.3.6. Tagger

FIG. 7 is a simplified block diagram of the tagger 708. The tagger 708takes as input one or more preprocessed documents 704, data from amachine-readable morphological dictionary 728 that has been processed bya morphological analyzer 724, tag information 732, and lexicalprobabilities 740 and contextual probabilities 744 from the tagger'sdatafile 736. The tagger 708 outputs one or more tagged documents 712,i.e., the input preprocessed documents 704 augmented with part of speechinformation for each token in those documents.

The tagger 708 is responsible for assigning an appropriate part ofspeech from a given tagset to each word given to it (e.g., the UPenntagset). It is also responsible for determining the base (uninflected)form of the word. The tagger 708 makes use of a tagger datafile 736produced by a trainer module 748, and a file which specifies the tagsused 732. The tagger is case-insensitive.

The preprocessed documents 704 are prepared for use 716 by the tagger.The tagger 708 receives as input a word and an optional alternate wordas input. The alternate word is tried if the primary word is not presentin the lexicon of the tagger datafile. The lexicon is case-insensitive.

The tagger 708 and trainer 748, work with tag sequences of a givenlength n. The tagger datafile 736 contains lexical probabilities 740 andcontextual probabilities 744. The contextual probabilities 744 refer totag sequences of the aforementioned length, and represent theconditional probability of a given tag given the preceding tags. Becauseof this, the tag for a given input word will not be determined untiladditional n words have been input. Flushing or stopping the tagger willforce all input words to be assigned a tag.

Starting the tagger causes it to automatically input n leading wordscalled the start gutter. These special words can only have one tag whichis the start tag, a special tag not defined in the tagset being used.Flushing or stopping the tagger causes it to input n trailing wordscalled the stop gutter. These special words can also only have one tagwhich is the stop tag, also not defined in the tagset being used. Thestart and stop gutters do not appear in the output. The reason for thegutters is to improve tagger performance on words that appear at thestart and end of phrases.

The tagger operates on the preprocessed documents as follows.

for each phrase reset word fifo queue reset possible tag sequences addstart gutter (add three gutter start words as for each input word: addword to queue retrieve possible tags for word from for each tag for eachpossible tag sequence tack on this tag to the sequence probability ofthis tag sequences are now A B C D with D find the first sequence withthe most group sequences by tags B and C for each group mark thesequence with the most take word from queue take tag A from the overallsequence if word is not a gutter word use morphological analyzer outputword, base form, and tag discard all unmarked tag sequences shift tagsequences left, dropping tag A add stop gutter (add three gutter stop

For each input word, the tagger outputs the primary word, its base form(as given by a morphological analyzer 724), and a tag from the tagsetbeing used. This output is added to documents 760. Final output is a setof tagged documents 712.

2.3.7. Parser

FIG. 8 is a simplified block diagram of the parser 820. The parser,which may be a partial parser 840, takes as input one or more taggeddocuments 810, data from tag maps 850 and a grammar 860, and outputs oneor more parsed documents 830. A linguistically annotated document is atagged document augmented with syntactic information and otherlinguistic information for each sentence in the document.

Dependencies are established by parsing. The parser is responsible forassigning a parse structure to input text and assigning roles to thefound constituents. Considerations of efficiency and robustness suggestthat the best-suited parsers for this purpose, given the current stateof the art in parsing, are those collectively known as partial parsers840, an instance of which is the partial parser Cass2 produced by StevenAbney. See Abney, S., “Part-of-Speech Tagging and Partial Parsing,” InK. Church, S. Young, and G. Bloothooft (Eds.), Corpus-Based Methods inLanguage and Speech. Kluwer Academic Publishers, Dordrecht, TheNetherlands (1996).

The partial parser 840 takes as input words, word base forms, and partof speech tag information. On receipt of an end-of-phrase marker thepartial parser 840 will parse the input it has received so far.

The partial parser 840 uses as data files tag maps 850 and a grammar860. The tag maps contain a mapping from the tagset being used by thetagger (708 in FIG. 7) to the internal tagset used by the grammar 860.The grammar 860 contains the descriptions of the rules to be used by theparser to partial parse the input.

2.3.8. Conceptual Annotator

FIG. 9 is a simplified block diagram of the conceptual annotator 935.The conceptual annotator 935 takes as input linguistically annotateddocuments 920. The linguistically annotated documents 920 may have beenstored 930. The documents may have been stored in TML format 930. If so,the documents need to be converted 925 back into the internalrepresentations output by the parser (820 in FIG. 8).

The conceptual annotator 935 also uses CSL Concepts and Rules forannotation 915 and may also use the parser grammar 950 and synonyms fromthe processed synonym resource 910. The conceptual annotator 935 outputsconceptually annotated documents 945. An annotated document is alinguistically annotated document augmented with labels identifying theoccurrences of Concepts and Concept Rules within the document. Theconceptual annotator 935 comprises a Concept identifier 940.

2.3.9. Concept Identifier

FIG. 10 is a simplified block diagram of the Concept identifier 1027.The Concept identifier 1027 outputs conceptually annotated documents1021, which are linguistically annotated documents 1012 augmented withthe location of Concepts and Concept Rules in the text of thosedocuments.

Four different techniques are used for concept identification: finitestate matching 1063, recursive descent matching 1066, bottom-up matching1069, and index-based matching 1072. Which technique is used depends on(1) whether it is desirable to use linguistically annotated documents1012 that are indexed or non-indexed 1036, (2) the importance of speedof matching 1039, (3) the availability of memory resources 1042, and (4)the expected amount of backtracking.

Input to the Concept identifier 1027 and to all four techniques islinguistically annotated documents 1012, the processed synonym resource1006, and CSL Concepts and Rules for annotation 1009. Additionally, thegrammar from the (partial) parser 1024 is used in finite state matching1063 and an inverted index 1048 is used in index-based matching 1072.

The four techniques are described below by explaining how they match asingle-term Pattern A, and the operators A Precedes B, A Dominates B, AOR B, and A AND NOT B.

If the data to be matched is non-indexed, then available techniquesinclude, but are not limited to: finite state matching 1063, recursivedescent matching 1066, and bottom-up matching 1069. Their use oflinguistically annotated documents 1012, the processed synonym resource1006, and CSL Concepts and Rules for annotation 1009 is described. Aworked example is given for the text the cat chased the dog.

The speed, space requirements and need of backtracking of the differenttechniques are discussed in their respective subsections.

The following common definitions are assumed throughout the descriptionsof the Concept identifier 1027 and the four Concept identificationtechniques. A “word” is defined as an annotated word or other annotatedtoken that is associated with the text of one or more linguisticallyannotated documents 1012. A “constituent” is a syntactic construct suchas a noun phrase, which can be represented via a tag or other label. Forexample, the constituent “noun phrase” might be represented with a tagcalled NX. Constituents may be found in linguistically annotateddocuments 1012.

A “text” is a sequence of words ordered according to the original textof one or more linguistically annotated documents 1012. A “Text Tree”1033 is an internal representation, a data structure that implicitly orexplicitly contains words and constituents and linguistic relationsbetween them as described in one or more linguistically annotateddocuments 1012. Text Trees 1033 are extracted 1030 from linguisticallyannotated documents 1012. FIG. 11 shows a simplified Text Tree 1033 forthe text the cat chased the dog.

The term “position” refers to a position of a word or constituent in atext. Using the Text Tree 1033 in FIG. 11 as an example, the integer 0represents the position of the first word the in the text, 1 representsthe position of the word cat, and so forth.

The term “interval” refers to a consecutive sequence of words in a text.An interval can be represented in various ways, for instance, as twointegers separated by a dash, where the first integer is the startposition; and the second integer is the end position. For example, inthe Text Tree 1033 in FIG. 11, cat occupies 1-1, and the cat occupies0-1.

The term “depth” refers to the depth of a word or operator in arepresentation of text such as a Text Tree 1033. In FIG. 11, forexample, the cat and chased have a depth of 2, whereas the dog has depth3.

A “span” is a word or constituent, or alternatively, a set of words andconstituents that follow each other, plus (optionally) some structuralinformation about the word(s) and constituent(s). Such structuralinformation includes, but is not limited to, position, interval, ordepth information. A span can be represented in various ways, forinstance, by integers or by other means. Again using the Text Tree 1033in FIG. 11 as an example, the span of the cat is the interval 0-1 anddepth 2.

The inverted index 1048, which is used in index-based matching 1072,contains words, constituents, and tags for structural information (e.g.,NX) from the linguistically annotated documents 1012 and their spans.

A “Concept Tree” 1054 is a data structure that represents Concepts,Concept Rules and CSL Expressions and their sub-expressions and theirrelationships. Concept Trees 1054 are built 1060 from CSL Concepts andRules for annotation 1009.

Building Concept Trees 1060 takes as input CSL Concepts and Rules forannotation 1009. Each Concept and Rule is represented as a Concept Tree1054. Build Concept Trees 1060 checks through each tree representationfor Concepts. It expands out each Concept into a series of disjunctions,represented in tree form. For example, build Concept Trees 1060 willtake a Concept “Animal” that consists of two Rules

-   -   Concept:    -   Concept Rule1: “the Precedes cat”    -   Concept Rule2: “the Precedes dog”        and translates the Concept into the Concept Tree 1054 shown in        FIG. 12.

For the purposes of simpler explanation, Concept Rules are often treatedas CSL Expressions that are associated with them. Similarly, Conceptsare treated as CSL Expressions that represent disjunctions of theirConcept Rules.

2.3.9.1. Finite State Matcher

Finite state matching 1063 takes as input finite state automata (FSAs)1057 and Text Trees 1033 and produces as output conceptually annotateddocuments 1021.

While finite state matching 1063 provides the fastest matching with nobacktracking, there is a trade-off between processing time and storagespace. Before finite state matching 1063 can be performed, Concept Trees1054 must first be compiled into finite state automata (FSAs) 1057.Concept Tree compilation into FSAs 1051 uses considerable storage space.

CSL compilation into FSAs 1051 uses as input Concept Trees 1054 and theprocessed synonym resource 1006. It may also use the grammar of the(partial) parser 1024.

CSL compilation into FSAs 1051 is now explained for a single-termpattern A, and the operators A Precedes B, A Dominates B, A OR B, and AAND NOT B. FSAs for these patterns and operators are shown in FIG. 13through FIG. 18 below. The FSA generated are non-deterministic. There isno mapping for the AND NOT operator given its current definition.

Each CSL Concept is made up of a number of disjunctive rules, eachcomprising of a single CSL Expression. These are compiled separately andthen added together and simplified using standard FSA techniques. Thisyields a single FSA representing the Concept.

FSA consist of a number of states and transitions. There is one startstate from where the FSA begins, indicated by a circle with an “S”.There may be multiple end states, indicated by a circle with an “E”,which are the states the automaton must reach in order to succeed.

Traversing transitions consumes input and changes the current state. Thetransition label “*” will match any input. The label “˜WORD” will matchany word other than WORD. “˜CONSTITUENT” will match anything other thanthe constituent marker CONSTITUENT.

Compilation of a single-term Pattern A produces the FSA shown in FIG.13. The pattern A is a terminal node in a Concept Tree 1054, where A canbe either a word alone or a word and a part-of-speech tag. The FSA hastransitions that match a word (or a word and a part-of-speech tag) onlyif it satisfies all the constraints imposed by the pattern A.

Compilation of A Precedes B produces an FSA shown in FIG. 14, but onlywhen there is no dominance within a Concept Tree 1054. Each sub-term (Aand B) is mapped recursively.

Compilation into FSAs of the Dominates operator is only possible with anactual Concept Tree 1054. When there is a dominance operation in the CSLExpression, all productions of the dominating constituent are taken fromthe partial parser grammar 1024. The CSL Expression is then matchedagainst these productions. This gives possible values and placements forconstituent markers as well as the expression being dominated. MultipleFSA will be produced depending on the number of productions and possibleCSL matches. The FSA produced are portions of the partial parser grammar1024, converted from regular expressions into FSA, instantiated with theCSL that needs to be matched.

To demonstrate the compilation of the Dominates operator into FSAs, letus assume that the CSL Expression VX dominates dog is one of the CSLConcepts and Rules for annotation 1009. Build Concept Trees 1060 buildsthe Concept Tree 1054 for this CSL Expression. That Concept Tree 1054 isshown in FIG. 15.

Let us also assume that the (partial) parser grammar 1024 contained thefollowing rules:

-   -   VX=verb NX    -   NX=noun

Compilation of A Dominates B for these grammar rules 1024 and theConcept Tree 1054 shown in FIG. 15 produces the FSA shown in FIG. 16.

Compilation of A OR B produces an FSA shown in FIG. 17, but only whenthere is no dominance within a Concept Tree 1054. Each subterm (A and B)is mapped recursively.

Compilation of the AND NOT operator is not possible.

The resulting FSAs can be simplified using standard FSA techniques thatare widely known. The process for traversing a finite state machine(given some text as input) uses standard finite state techniques asimplemented in Van Noord, G., FSA6 Reference Manual (2000).

Matching occurs by taking the automata for the concepts to be consideredand feeding them a flattened Text Tree 1033.

To see how finite state matching 1063 works, consider how the Text Tree1033 of FIG. 11 is matched against the Concept Tree 1054 of FIG. 12. TheConcept Tree 1054 is compiled into the FSA shown in FIG. 16.

The Text Tree 1033 is flattened to:

-   -   #CX #NX the cat/NX #VX chased #NX the dog/NX/VX/CX

The next step is to start feeding the automata until it is able tocontinue with each word from the flattened text tree. Table 2 shows howinput is consumed by traversing the transitions to reach the end stateE, signifying success.

TABLE 2 State Transition Input S — #CX #NX the cat /NX #VX chased #NXthe dog /NX /VX /CX S — #NX the cat /NX #VX chased #NX the dog /NX /VX/CX S A the cat /NX #VX chased #NX the dog /NX /VX /CX 1 C cat /NX #VXchased #NX the dog /NX /VX /CX E —

2.3.9.2. Recursive Descent Matcher

The advantages of the recursive descent matcher over other matchersinclude relative simplicity and small space requirements. However, thereis a possibility of nontrivial amount of backtracking for certainConcepts.

The recursive descent matcher 1066 takes as input one or more ConceptTrees 1054, one or more Text Trees 1033, and the processed synonymresource 1006.

Given a CSL Expression from the Concept Tree 1054, a Text Tree 1033, anda position in the text, the recursive descent matching algorithm 1066can determine whether the CSL Expression matches the text at the givenposition. It can also determine the span of words that match the CSLExpression. Recursive descent matching 1066 is now explained for asingle-term Pattern A, and the operators A Precedes B, A Dominates B, AOR B, and A AND NOT B.

Single-term Pattern A matches at the position if there is a word or aconstituent at the position that satisfies all the constraints imposedby the pattern A.

A Precedes B matches at the position if A matches at the position and Bmatches at a position that follows A. The spans of A and B must notoverlap, but can be non-contiguous.

A Dominates B matches at the position if B matches at the position thatis within the text spanned by the subtree of A.

A OR B matches at the position if A matches at the position or B matchesat the position.

A AND NOT B matches at the position if A matches at the position and Bis not found within the span of A.

For an example of how recursive descent matching 1066 works, considerhow the Text Tree 1033 of FIG. 11 is matched against the Concept Tree1054 of FIG. 12.

The recursive descent matcher 1066 attempts to match a given Concept ateach position in the text.

At position 0 (the first instance of the word the), the matcher 1066traverses the Concept Tree 1054 in a top-down fashion. It firstencounters the OR operator. It then checks the first sub-expression ofOR, which is the Precedes¹ operator. For this operator, the matcherchecks its first sub-expression, which is the single-term pattern the¹,against the Text Tree 1033. Since the¹ succeeds, the matcher checks thesecond sub-expression (the single-term pattern cat) against the TextTree 1033, temporarily incrementing the position in the text. It findsthe word cat, so the Precedes¹ operator succeeds. Since the firstsub-expression of the OR operator has succeeded, there is no need tocheck the second, and the Concept Tree 1054 (and hence the Concept as awhole) succeeds. Each successful node in the Concept Tree 1054 recordsthe span that it matched to the node above it, so the overall result isthe span the cat.

The recursive descent matcher 1066 now increments the text position toposition 1, which is the position of the word cat. The matcher goesthrough the Concept Tree 1054 until it hits the single-term patternthe¹, which doesn't match against the Text Tree 1033. Then the matcher1066 tries the other branch of the OR until it also hits the single-termpattern the¹, which doesn't match against the Text Tree 1033, so thematch fails at position 1. The matcher 1066 also fails at position 2.

Similarly at position 2 (the word chased) the word the is not found, sothe match fails. However, at position 3 the first branch of the ORfails, but the second branch returns the span the dog. The matcher 1066works in similar fashion through the remainder of the Text Tree 1033,but no more matches are found.

2.3.9.3. Bottom-Up Matcher

The advantage of the bottom-up matcher 1069 is its ability to produceall solutions in a reasonable memory space without backtracking.

The bottom-up matcher 1069 takes as input one or more Concept Trees1054, one or more Text Trees 1033, and the processed synonym resource1006.

The matcher 1069 computes spans consumed by single-term patterns andspans consumed by operators from a Concept Tree 1054 in a bottom-upfashion.

For each single-term Pattern A, the algorithm computes the spans thatmatch the pattern by consulting the Text Tree 1033 and selecting spansthat satisfy all the constraints imposed by the pattern. Table 3 showsmappings built between single-term Patterns in the Concept Tree 1054 ofFIG. 12 and spans for the words of the Text Tree 1033 of FIG. 11.

TABLE 3 Single-term pattern in Concept Tree Spans of words in Tex Treethe¹ interval 0-0, depth 2; interval 3-3, depth 3 cat interval 1-1,depth 2 the² interval 0-0, depth 2; interval 3-3, depth 3 dog interval4-4, depth 3

For every operator, the bottom-up matcher 1069 builds indicesrepresenting the spans of text consumed by that operator. Because thematcher 1069 works in a bottom-up fashion, each operator knows the spansof its arguments. Given spans for their arguments A and B, the spans fordifferent operators can be computed as follows.

A Precedes B. For every pair of spans from A and B, such that the spanfrom A precedes the span from B, output a span that spans the bothspans. Set the depth to be the minimum of the two.

A Dominates B. For every pair of spans from A and B, such that the spanfrom A overlaps at a lesser depth than the span from B, output the spanfrom B. For example, in the Text Tree 1033 of FIG. 11, #CX (interval0-4, depth 0) dominates #VX (interval 2-4, depth 1).

A OR B. For A and B, output every span that is a span of A or of B.

A AND NOT B. Select all spans from A such that there is no span in Bthat would be overlapped by the span from A.

The general bottom-up computation can be enhanced with a (possibly)top-down passing of constraints between CSL Expressions. The constraintscan be used to limit the number of matches returned and thus to providemore efficient computation.

Table 4 shows mappings built between operators in the Concept Tree 1054of FIG. 12 and spans for the operators in the Text Tree 1033 of FIG. 11.

TABLE 4 Operators in Concept Tree Spans in Text Tree Precedes¹ interval0-1, depth 2 Precedes² interval 3-4, depth 3; interval 0-4, depth 2;interval 0-4, depth 2 OR interval 0-1, depth 2; interval 3-4, depth 3

The bottom-up matcher 1069 matches the Concept Tree 1054 against theText Tree 1033 in the following manner. The matcher 1069 combines thespans for the single-term Pattern the¹ with the spans for thesingle-term Pattern cat to obtain the spans for Precedes¹ operatorcorresponding to the¹ Precedes¹ cat. The matcher 1069 considers possiblepairs of spans where the first span comes from the spans of the¹ and thesecond span comes from the spans of cat. The possible pairs are<interval 0-0, depth 2; interval 1-1, depth 2> and <interval 3-3, depth3; interval 1-1, depth 2>. Only the first pair of spans <interval 0-0,depth 2; interval 1-1, depth 2> satisfies the condition that the firstspan precedes the second span. The spans from the pair are combinedtogether to produce a single span <interval 0-1, depth 2> as recorded inTable 4. Similarly, the spans for the² are combined with spans for dogto produce the spans for the

Precedes² operator in the² Precedes² dog. The spans for the OR operatorare computed as a union of the spans for the two Precedes operators. Theresults are again shown in Table 4.

2.3.9.4. Index-Based Matcher

FIG. 19 is a simplified block diagram of the index-based matcher 1965.The index-based matcher 1965 takes as input one or more Concept Trees1940 and the processed synonym resource 1910. The index-based matcher1965 also takes as input an inverted index 1935 of one or more TextTrees 1920.

An inverted index 1935 contains spans of each word and constituent(e.g., noun phrase) in a Text Tree 1920. Consider the text the catchased the dog and its Text Tree 1920 shown in FIG. 11. Table 5 shows aninverted index for that Text Tree 1920.

TABLE 5 Words and constituents Spans of words and constituents #CXinterval 0-4, depth 0 #NX interval 0-1, depth 1; interval 3-4, depth 2#VX interval 2-4, depth 1 the interval 0-0, depth 2; interval 3-3, depth3 cat interval 1-1, depth 2 chased interval 2-2, depth 2 dog Interval4-4, depth 3

The index-based matcher 1965 is faster than matchers that use parsetree-like representations of the linguistically annotated documents.

When Text Trees 1920 are indexed, there remains a choice of at least twoindex-based matching techniques, depending on the sparseness of Concepts1970. The sparseness of Concepts 1970 affects the speed of matching.Candidate checking index-based matching 1980 seems to be faster forsparsely occurring Concepts, whereas simple index-based matching 1975seems to be faster for densely occurring Concepts. (Note that it wouldbe straightforward to build an index-based finite state matcher 1063 andan index-based recursive descent matcher 1066, though we have notelected to do so.)

2.3.9.4.1. Simple Index-Based Matcher

The simple index-based matcher 1975 traverses the Concept Tree 1940 in arecursive depth-first fashion, using backtracking to resolve theconstraints of the various CSL operators, until all matches are producedfor the items of the Concept Tree 1940 against the text in the invertedindex 1935. Familiarity with the techniques of recursion andbacktracking is a necessary prerequisite for understanding thisalgorithm.

Each node in the Concept Tree 1940 maintains a state that is used todetermine whether or not it has been processed before in the match ofthe Concept Tree 1940 against the inverted index 1940, and also relevantinformation about the progress of the match.

The state of nodes for single-item patterns that are words includes thefollowing information: a list of applicable synonyms of the word inquestion, and the current synonym being used for matching; an iteratorinto the inverted index 1940 that can enumerate all instances of theword in the index 1940, and which records the current word.

Nodes for single-term Patterns that are tags and constituents simplymaintain iterators into the index, which record the particular instanceof the tag or constituent node being used for the current match.

Nodes for Precedence, Dominance, OR and AND NOT operators all recordwhich of their operands have been tried and have been successful in thecurrent match.

During the course of a match, each node is tested, and if successful,returns a set of spans covering the match of its corresponding CSLsub-expression (i.e., all nodes below that node in the Concept Tree1940).

To understand how simple index-based matching 1975 proceeds, consideragain the text the cat chased the dog and its Text Tree 1920 shown inFIG. 11. The simple index-based matcher 1975 uses the inverted index1935 for Text Tree 1920. This inverted index 1935 for Text Tree 1920 isshown in Table 5.

Table 6 shows the span information that the inverted index 1935 hasalready recorded for those words in Text Tree 1920 shown in FIG. 11.

TABLE 6 Words Spans of words the interval 0-0; interval 3-3 cat interval1-1 chases interval 2-2 dog interval 4-4

The match begins at the OR operator. The matcher notes that none of theoperands have been tried before in the current match, so it picks thefirst one, a Precedes¹ operator. The matcher notes that the firstoperand of Precedes¹ has not been tried in this match, so it tries it.The node for the word the notes that it has never before been processed,so it initializes its iterator into the inverted index 1935, and returnsa successfully matched span at 0-0. The matcher then processes thesecond operand of the Precedes¹ operator. This is the word cat, and itis successful at 1-1. The matcher then makes sure that the spansreturned from the operands satisfy the constraints of a precedencerelationship, namely that the first span should precede the second. Inthis case, the constraints are satisfied, so the Precedes¹ operator issuccessful with the span 0-1. The OR operator succeeds immediately,since only one of its operands need match, and returns 0-1 as well. Thusthe entire match succeeds with the span 0-1.

Then the matcher backtracks. Upon backtracking, the matcher proceedsdown the Concept Tree 1940 immediately to the last node it tried—theword cat. Upon backtracking, the word node attempts to increment itsiterator. Since there are no more instances of the word cat in theindex, the node fails. The Precedes¹ node then backtracks through itsfirst operand, which is the word the. When the word node increments itsiterator, it finds the second instance of the word the, at 3-3. Then thePrecedes¹ node tries its second operand again, whereupon it fails. Itthen backtracks yet again through its first operand. Since there are nomore instances of the word the, the word node fails, and the Precedes¹node finally fails as well. Since the Precedes¹ node has failed, the ORoperator tries its other operand. Matching proceeds as above until thesecond match (at 3-4) is found. Upon further backtracking all theiterators are exhausted and there are no more matches to be found.

2.3.9.4.2. Candidate Checking Index-Based Matcher

The candidate checking index-based matcher 1980 has two parts. The firstpart identifies candidate spans 1960 that might, but do not necessarilycontain, a Concept match. The second part produces (possibly empty) setof matches within the candidate span.

Identifying a candidate span. The purpose of identifying candidate spans1960 is to weed out pieces of text that cannot contain a Concept match.

A candidate span 1960 is a span that can overlap one or more matches. Acandidate span can be as large as the whole text from a single document.The matcher 1980 splits up such a text into smaller texts of amanageable size. For example, the matcher 1980 would split the text

The cat chased the dog. John loves Mary. Peter loves Sandra.

Into the following candidate spans 1960: The cat chased the dog (span0-4), John loves Mary (span 5-7) and Peter loves Sandra (span 8-10).(Note that the numbers in these spans ignore the full stops at the endof each sentence.). The iterators introduced in the example below wouldgenerate the span 0-4 as the candidate span 1960. However, spans 5-7 and8-10 would not be generated as candidate spans 1960.

Consider, as another example, the Concept Tree 1940 in FIG. 12. Usingcandidate spans 1960 that correspond to documents, it might be useful toconstrain the expensive matching for the Concept Animal to only thosedocuments that contain words the and cat or that contain words the anddog. This will save the cost of matching on documents that can neversatisfy the matching constraints, because, for example, they do notcontain the words cat or dog.

Each CSL sub-expression from the Concept Tree 1940 is associated with anIterator. Iterators can be used to generate candidate spans or checkwhether a given span is a candidate span. The pieces of text that arenot covered by any of the generated candidate spans are guaranteed tonot contain matches.

Iterators behave differently with different kinds of CSL Expressions.One of the possible behaviors for the Iterators is given below.

Single-term Pattern A. The inverted index 38 is used to generate orcheck candidate spans 1960.

I Precedes I. The only spans returned are those generated as candidatespans 1960 by both Iterators associated with the arguments A and B.Similarly, the only spans checked as candidate spans 1960 are thosespans that are also candidate spans 1960 of the Iterators associatedwith the arguments A and B.

A Dominates B. The behaviour of the Iterator A Dominates B is the sameas for A Precedes B.

A OR B. The only spans returned are those generated as candidate spans1960 by at least one of the Iterators associated with arguments A and B.Similarly, the only spans checked as candidate spans 1960 are thosespans that are also candidate spans 1960 of at least one of theIterators associated with arguments A and B.

A AND NOT B. The only spans returned are those generated as candidatespans 1960 by the Iterator associated with argument A. Similarly, theonly spans checked as candidate spans 1960 are those spans that are alsocandidate spans 1960 of the Iterator associated with argument A.

Consider, for example, how the span is checked that covers the wholeText Tree 1920 for the cat chased the dog in FIG. 11. The OR Iteratorsucceeds because its first argument, the Precedes¹ Iterator, succeedswhen it check its span. The Precedes¹ Iterator succeeds because both ofits argument Iterators succeed. Those argument Iterators are the the¹Iterator and the cat Iterator. The the¹ Iterator succeeds because theinput span (for the cat chased the dog) overlaps both of the spans forthe¹ from the inverted index 1935 in Table 6. The span for the catchased the dog is 0-4. The spans for the¹ in the inverted index 1935 are0-0 and 3-3. The cat Iterator succeeds for the same reason.

Producing a Set of Matches. The candidate checking matcher 1980 can useany of the other Concept identification techniques describedpreviously—FSA matching 1063, recursive descent matching 1066, bottom-upmatching 1069, or simple index based matching 1975—or any other possibleidentification techniques to produce the set of matches.

As an example, consider the use of recursive descent matcher 1066 withthe text

The cat chased the dog. John loves Mary. Peter loves Sandra.

This text produces the candidate spans 1960 The cat chased the dog (span0-4), John loves Mary (span 5-7) and Peter loves Sandra (span 8-10).

In its preferred form as described in the Recursive Descent Matchersection (2.3.9.2), the matcher 1066 would try to match the Concept Treefrom FIG. 12 to the Text Tree 1920 corresponding to the whole text atevery possible position between 0 and 10. However, using the candidatespan 1960 information (knowing that matches cannot occur betweenpositions 5-7 and 8-10), recursive descent matching 1066 can beconstrained to check the Concept Tree from FIG. 12 only at positions 0,1, 2, 3 and 4. It is not necessary for the recursive descent matcher1066 to check positions 5, 6, 7, and 8. Hence, recursive descentmatching 1066 can proceed faster.

2.3.10. Synonym Processor

FIG. 20, FIG. 21, and FIG. 22 are simplified block diagrams of a synonymprocessor 2010, 2110, and 2210 in various configurations. The synonymprocessor 2010, 2110, and 2210 takes as input a synonym resource 2020,2120, and 2220 such as WordNet, a machine-readable dictionary, or someother linguistic resource. Such synonym resources 2020, 2120, and 2220contain what we call “synonymy relations.” A synonymy relation is abinary relation between two synonym terms. One term is a word-sense; thesecond term is a word that has a meaning synonymous with the first term.Consider, for example, the word snow, which has several word senses whenused as a noun, including a sense meaning “a form of precipitation” andanother sense meaning “slang for cocaine.” The former sense of snow hasa number of synonymous terms including meanings of the words snowfalland snowflake. The latter sense of snow includes meanings of the wordscocaine, cocain, coke, and C. Hence, snowfall and snowflake are in asynonymy relation with respect to the noun-sense of snow meaning “a formof precipitation.”

FIG. 20 shows the preferred embodiment in which the synonym processor2030 comprises a synonym pruner 2050 and synonym optimizer 2070. This isthe configuration described in Turcato, D., Popowich, F., Toole, J.,Fass, D., Nicholson, D., and G. Tisher, “Adapting a Synonym Database toSpecific Domains,” In Proceedings of the Association for ComputationalLinguistics (ACL) '2000 Workshop on Recent Advances in Natural LanguageProcessing and Information Retrieval, 8 Oct. 2000, Hong Kong Universityof Science and Technology, pp. 1-12 (October 2000), (cited hereafter as“Turcato et al. (2000)”), which is incorporated herein by reference. Therest of the description assumes this configuration, except where statedotherwise.

FIG. 21 and FIG. 22 are simplified block diagrams of the synonymprocessor 2110 and 2210 in two less favoured configurations. FIG. 21 isa simplified block diagram of the synonym processor 2210 containing justthe synonym pruner 2250. FIG. 22 is a simplified block diagram of thesynonym processor 2210 containing just the synonym optimizer 2280.

2.3.10.1. Synonym Pruner

FIG. 23, FIG. 24, and FIG. 25 are simplified block diagrams of thesynonym pruner 2315, 2415, and 2515 in various configurations. Thesynonym pruner 2315, 2415, and 2515 takes as input a synonym resource2310, 2410, and 2510 such as WordNet, a machine-readable dictionary, orsome other linguistic resource. The synonym pruner 2315, 2415, and 2515produces those synonymy relations required for a particular domain(e.g., medical reports, aviation incident reports). Those synonymyrelations are stored in a pruned synonym resource 2320, 2420, and 2520.

The synonym resource 2310, 2410, and 2510 is incrementally pruned inthree phases, or certain combinations of those phases. In the first twophases, two different sets of ranking criteria are applied. These setsof ranking criteria are known as “manual ranking” 2325, 2425, and 2525and “automatic ranking” 2345, 2445, and 2545. In the third phase, athreshold is set and applied. This phase is known as “synonym filtering”2355, 2455, and 2555.

FIG. 23 shows the preferred embodiment in which the synonym pruner 2315comprises manual. ranking 2325, automatic ranking 2345, and synonymfiltering 2355. This is the configuration used by Turcato et al. (2000)cited above. The rest of the description assumes this configuration,except where stated otherwise.

FIG. 24 and FIG. 25 are simplified block diagrams of the synonym pruner2415 and 2515 in two less favoured configurations. FIG. 24 is asimplified block diagram of the synonym pruner 2415 containing justmanual ranking 2425 and synonym filtering 2455. FIG. 25 is a simplifiedblock diagram of the synonym pruner 2515 containing just automaticranking 2545 and synonym filtering 2555.

A variant of FIG. 25 is FIG. 25 a, in which the automatically rankedsynonym resource 2550 a produced by the human evaluation ofdomain-appropriateness of synonymy relations queries 2545 a is passed tohuman evaluation of domain-appropriateness of synonymy relations queries2552 a before input to synonym filtering queries 2555 a.

The manual ranking process 2325 consists of automatic ranking ofsynonymy relations in terms of their likelihood of use in the specificdomain 2330, followed by evaluation of the domain-appropriateness ofsynonymy relations by human evaluators 2335.

The automatic ranking of synonymy relations 2330 assigns a “weight” toeach synonymy relation. Each weight is a function of the actual orexpected frequency of use of a synonym term in a particular domain, withrespect to a particular sense of a first synonym term. For example,Table 7 shows weights assigned to synonymy relations in the aviationdomain between the precipitation sense of snow and its synonym termscocaine, cocain, coke, and C.

TABLE 7 Synonymy relation between precipitation sense of snow and asynonym term Weight cocaine 1 cocain 0 coke 8 C 9168

One possible method and system (of many possible methods and systems)for the automatic ranking of synonymy relations 2330 that may be usedwith the present invention is described in section 2.2.1 of Turcato etal. (2000). Where no inventory of relevant prior queries exists for thedomain then the ranking may be simply in terms of domain corpusfrequency. Where an inventory of relevant prior queries exists, then theranking uses the frequency of the occurrence of the term in the domaincorpus and the inventory of query terms to estimate how often a givensynonymy relation is likely to be used.

The set of synonymy relations and their weights are then ranked fromgreatest weight to least, and then presented in that ranked order tohuman evaluators for assessment of their domain-appropriateness 2335.The weights are useful if there are insufficient evaluators to assessall the synonymy relations, as is frequently the case with large synonymresources 2310. In such cases, evaluators begin with the synonymyrelations with greatest weights and proceed down the rank-ordered list,assessing as many. synonymy relations as they can with the resourcesthey have available.

The judgement of appropriaten ss of synonymy relation in a domain mightbe a rating in terms of a binary Yes-No or any other rating scheme theevaluators see fit to use (e.g., a range of appropriateness judgements).

The output of manual ranking 2325 is a manually ranked synonym resource2340. The manually ranked synonym resource 2340 is like the synonymresource 2310, except that the synonymy relations have been ranked interms of their relevance to a specific application domain. No synonymyrelations are removed during this phase.

In the second phase of the preferred embodiment shown in FIG. 23, themanually ranked synonym resource 2340 is automatically ranked 2345.Automatic ranking 2345 is based on producing scores representing thedomain-appropriateness of synonymy relations. The scores are producedfrom the frequencies in a domain-specific corpus of the words involvedin the synonymy relation, and the frequencies of other semanticallyrelated words. Those words involved in the synonymy relation arepresently, but need not be limited to, terms from the lists of synonymsand dictionary definitions for words. Other semantically related wordsinclude, but need not be limited to, superordinate and subordinate termsfor words.

One possible method and system (of many possible methods and systems)for the automatic ranking of the domain-appropriateness of synonymyrelations 2345 that may be used with the present invention is describedin section 2.3 of Turcato et al. (2000).

The output of automatic ranking 2345 is an automatically ranked synonymresource 2350 of the same sort as the manually ranked synonym resource2340, with the ranking scores attached to synonymy relations. Again, nosynonymy relations are removed during this phase.

In synonym filtering 2355, a threshold is set 2360 and applied 2365 tothe automatically ranked synonym resource 2350, producing a filteredsynonym resource 2370. It is during this phase of synonym pruning 2360that synonymy relations are removed.

The threshold setting 2360 in the preferred embodiment is flexible andset by the user through a user interface 2305, though neither needs tobe the case. For example, the threshold could be fixed and set by thesystem developer or the threshold could be flexible and set by thesystem developer.

The three phases just described can be configured in ways other than thepreferred embodiment just described. Firstly, strictly speaking,automatic pruning 2345 could be performed manually, though it wouldrequire many person-hours on a synonym resource 2310 of any size.Second, in the preferred embodiment, the pruned synonym resource 2310 isthe result of applying two rounds of ranking. However, in principle, thepruned synonym resource 2320 could be the result of just one round ofranking: either just manual ranking 2425 as shown in FIG. 24 or justautomatic ranking 2545 as shown in FIG. 25.

2.3.10.2. Synonym Optimizer

FIG. 26, FIG. 27, and FIG. 28 are simplified block diagrams of thesynonym optimizer 2610, 2710, and 2810 in various configurations. Inputto of the synonym optimizer 2610, 2710, and 2810 is either anunprocessed synonym resource 2620, 2720, and 2820 or a pruned synonymresource 2630, 2730, and 2830. The input is a pruned synonym resource2630, 2730, and 2830 in the preferred embodiment of the synonymprocessor (shown in FIG. 20). The input is an unprocessed synonymresource 2620, 2720, and 2820 for one of the other two configurations ofthe synonym processor (shown in FIG. 22).

Output is an optimized synonym resource 2650, 2750, and 2850.

The synonym optimizer 2610, 2710, and 2810 identifies synonymy relationsthat can be removed that, if absent, either do not affect or minimallyaffect the behaviour of the system in a specific domain. It consists oftwo phases that can be used either together or individually. One ofthese phases is the removal of irrelevant synonymy relations 2660 and2760; the other is the removal of redundant synonymy relations 2670 and2870.

FIG. 26 shows the preferred embodiment in which the synonym optimizer2610 comprises both the removal of irrelevant synonymy relations 2660and the removal of redundant synonymy relations 2670. This is theconfiguration used by Turcato et al. (2000). The rest of the descriptionassumes this configuration, except where stated otherwise.

FIG. 27 and FIG. 28 are simplified block diagrams of the synonymoptimizer 2710 and 2810 in two less favoured configurations. FIG. 27 isa simplified block diagram of the synonym optimizer 2710 containing justthe removal of irrelevant synonymy relations 2760. FIG. 28 is asimplified block diagram of the synonym optimizer 2810 containing justthe removal of redundant synonymy relations 2870.

The removal of irrelevant synonymy relations 2660 eliminates synonymyrelations that, if absent, either do not affect or minimally affect thebehaviour of the system in a particular domain. One criterion for theremoval of irrelevant synonymy relations 2660 is: a synonymy relationthat contains a synonym term that has zero actual or expected frequencyof use in a particular domain with respect to a particular sense of afirst synonym term. For example, Table 1 shows weights assigned in theaviation domain for synonymy relations between the precipitation senseof snow and its synonym terms cocaine, cocain, coke, and C. The tableshows that the synonym term cocain has weight 0, meaning that cocain haszero actual or expected frequency of use as a synonym of theprecipitation sense of snow in the aviation domain. In other words, thesynonymy relation (precipitation sense of snow, cocain) in the domain ofaviation can be removed.

Note that the criterion for removing a synonym term need not be zeroactual or expected frequency of use. When synonym resources are verylarge, an optimal actual or expected frequency of use might be one orsome other integer. In such cases, there is a trade-off. The higher theinteger used, the greater the number of synonymy relations removed (withcorresponding increases in efficiency), but the greater the risk of aremoved term showing up when the system is actually used.

In most cases, users will accept that irrelevant synonym terms are thosewith zero actual or expected frequency of use. However, the userinterface 2640 allows users to set their own threshold for actual orexpected frequency of use, should they want to.

A possible method and system (of many possible methods and systems) forthe removal of irrelevant synonymy relations 2660 that may be used withthe present invention is described in section 2.4.1 of Turcato et al.(2000). In particular, terms which never appear in the domain corpus areconsidered to be irrelevant. If the domain corpus is sufficiently large,then terms which appear in a low frequency may still be considered to beirrelevant.

The removal of redundant synonymy relations 2670 eliminates redundanciesamong the remaining synonymy relations. Synonymy relations that areremoved in this phase are again those that can be removed withoutaffecting the behaviour of the system.

A possible method and system (of many possible methods and systems) forthe removal of redundant synonymy relations 2670 that may be used withthe present invention is described in section 2.4.2 of Turcato et al.(2000). In particular, sets of synonyms which contain a single term(namely the target term itself) are removed as are sets of synonymswhich are duplicates, namely are identical to another set of synonyms inthe resource which has not been removed.

The output of optimization 2610 is an optimized synonym resource 2650,which is of the same sort as the unprocessed synonym resource 2620 andpruned synonym resource 2620, except that synonymy relations that areirrelevant or redundant in a specific application domain have beenremoved.

Note that optimization 2610 could be used if the only synonym resourceto be filtered 2355 was the manually ranked synonym resource 2340produced by manual ranking 2325 within synonym pruning 2305. Indeed,optimization 2610 would be pretty much essential if manual ranking 2325and filtering 2355 was the only synonym pruning 2305 being performed.Optimization 2610 could also in principle be performed between manualranking 2325 and automatic ranking 2345, but little is gained from thisbecause irrelevant or redundant synonymy relations in the manuallyranked synonym resource 2340 do not affect automatic pruning 2345.

2.3.11. CSL Processor

FIG. 29 is a simplified block diagram of the CSL processor 2910. The CSLprocessor 2910, accessed by the user interface 2905, comprises a CSLConcept and Concept Rule learner 2985 and a CSL query checker 2990.

2.3.12. CSL Concept and Concept Rule Learner

FIG. 30 and FIG. 31 are simplified block diagrams of the CSL Concept andConcept Rule learner 3075 and 3175.

The CSL Concept and Concept Rule learner 3075 and 3175, accessed by theuser interface 3005 and 3105, takes as input a text corpus in whichinstances of a given concept have been highlighted 3045 and 3135, and itoutputs a list of CSL Concepts and Concept Rules 3070 and 3195, coveringthe occurrences marked up in the input corpus. The CSL Concept andConcept Rule learner 3075 and 3175 comprises two main internal methods:highlighted linguistically annotated documents 3050 and 3150 are passedto a CSL Rule creator 3055 and 3160 which produces CSL Rules 3060 and3165. These CSL Rules 3060 and 3165 are input to a method that createsCSL Concepts from Concept Rules 3065 and 3170, which outputs a list ofCSL Concepts and Concept Rules 3070 and 3195.

FIG. 30 and FIG. 31 present two different ways that instances ofconcepts may be highlighted. In FIG. 30, the CSL Concept and ConceptRule learner 3075 comprises first of all the highlighting of instancesof concepts 3045 in the text of linguistically annotated documents 3015to produce highlighted linguistically annotated documents 3050. Thoselinguistically annotated documents 3015 may be converted to TML 3020 (orsome other format) and may also be stored 3025. Those highlightedlinguistically annotated documents 3050 may also be converted to TML3020 (or some other format) and may also be stored 3085.

In FIG. 31, the CSL Concept and Concept Rule learner 3175 comprisesfirst of all the highlighting of instances of concepts 3135 in the textof documents 3115 to produce highlighted text documents 3125. Thelinguistic annotator 3145 processes those highlighted documents 3125 toproduce highlighted linguistically annotated documents 3150. Thosehighlighted text documents 3125 may be converted to TML 3120 (or someother format) and may also be stored 3140. The highlightedlinguistically annotated documents 3150 may also be converted to TML3120 (or some other format) and may also be stored 3155.

2.3.13. CSL Rule Creator

FIG. 31 is a simplified block diagram of the CSL Rule creator 3245. TheCSL Rule creator 3245 takes as input CSL vocabulary specifications 3230and highlighted linguistically annotated documents 3215 and outputs CSLRules 3240. The CSL vocabulary specifications 3230 and highlightedlinguistically annotated documents 3215 are matched together using theConcept identifier 3250. Then linguistic variants are defined 3255;synonyms 3255 are add d from a processed synonym resource 3210 (ifavailable), and parts of speech 3260 are also added before CSL Rules3240 are produced.

2.3.14. CSL Query Checker

FIG. 33 is a simplified block diagram of the CSL query checker 3355. TheCSL query checker 3355, accessed by the user interface 3305, takes asinput a proposed CSL query 3315 and, if all queries are known in advance3360, passes that query (a concept label) 3330 a to the retriever 3350.The retriever 3350 is part of the text document retriever 3545 (see FIG.35).

If all queries are not known in advance 3360, the CSL query checker 3355matches 3365 the proposed query 3315 a against known CSL Concepts andConcept Rules 3325 and, if a match is found 3370, the query (a CSLexpression) 3330 b is parsed 3320 and passed to the retriever 3350, elsethe query (also a CSL expression) 3330 b is parsed 3320 and added to thelist of CSL Concepts and Concept Rules to be annotated 3335, which arethen passed to the annotator 3345.

2.3.15. CSL Parser

FIG. 34 is a simplified block diagram of the CSL parser 3425. The CSLparser 3425 takes as input a CSL query 3430 (3330 b from FIG. 33), CSLConcepts and Rules 3410, and a processed synonym resource 3405, ifavailable. It outputs CSL Concepts and Rules for annotation 3415 andalso outputs parsed CSL queries for retrieval 3420. In the output,information left implicit in the input CSL Rules (e.g., about possibletags and synonyms) is made explicit.

The CSL parser 3425 comprises word compilation 3435, CSL Conceptcompilation 3440, downward synonym propagation 3445, and upward synonympropagation 3450.

Concepts are parsed as follows. Word synonyms (from the processedsynonym resource 3405) are propagated throughout the tag hierarchy. Thislets the input describe a word as a noun and have its synonymsautomatically defined for both singular and plural nouns, assuming thatthe tag hierarchy contains a tag for noun with children tags forsingular and plural nouns. The levels above noun would alsoautomatically contain the synonyms for noun, but they would be markedsuch that the words would only match if they are tagged as nouns.

A processed synonym resource 3405 (derived from, e.g., WordNet) providessynonyms for words at levels in the tag hierarchy that are referenced.(Synonyms are only propagated to levels in the tag hierarchy that arereferenced.)

In word compilation 3435, each word is compiled into a suitablestructure.

In Concept compilation 3440, each Concept is compiled into a suitablestructure. For every Concept, a word structure is added for eachreference to an undefined word.

In downward synonym propagation 3445, for every word, synonyms arepropagated down the tag hierarchy. Synonyms that move down take on thetag value given by the position in the hierarchy.

In upward synonym propagation 3450, for every word, synonyms arepropagated up the tag hierarchy. Synonyms that move up take on acombination of the tag values given by the positions that they came fromin the hierarchy.

2.3.16. Text Document Retriever

FIG. 35 is a simplified block diagram of the text document retriever3545. The text document retriever 3545, accessed by the user interface3505, comprises a retriever 3550. The retriever 3550 takes as inputannotated documents 3520 (that is, conceptually annotated documents) anda CSL query for retrieval 3515.

The annotated documents 3520 contain the names of Concepts that werematched by the Concept identifier during conceptual annotation (940 and935 respectively in FIG. 9). If the Concept identifier was index-based,the annotated documents 3520 also contain inverted index information.The retriever 3550 searches through the annotated documents 3520 for thenames of Concepts that match specific CSL expressions in a CSL query forretrieval 3515.

The retriever 3550 produces retrieved documents 3555 and categorizeddocuments 3530. Categorized documents 3530 used for retrieval 3550 mayhave been stored 3535 and may have been converted from TML 3510 or someother format. Similarly, categorized documents 3530 discovered duringretrieval 3550 may be stored 3535 and may have been converted from TML3510 or some other format. Retrieved documents 3555 have the samestorage and conversion possibilities (3510 and 3540).

The retriever 3550 passes retrieved documents 3555 and categorizeddocuments 3530 to a document viewer 3560 accessed by the user interface3505.

3. Concept Specification Language

This section contains a description of the syntax of CSL. CSL is alanguage for expressing linguistically-based patterns. It is comprisedof tag hierarchies, Concepts, Concept Rules, Patterns, Operators, andmacros.

A tag hierarchy is a set of declarations. Each declaration relates a tagto a set of tags, declaring that each of the latter tags is to beconsidered an instance of the former tag.

A Concept in the CSL is used to represent concepts. A Concept can eitherbe global or internal to other Concepts. A Concept uses words and otherConcepts in the definition of Concept Rules.

A Concept Rule comprises an optional name internal to the Conceptfollowed by a Pattern.

A Pattern may match

-   -   a) single terms in an annotated text (a “single-term Pattern”)        or    -   b) some configuration in an annotated text (a “configurational        Pattern”).

A single-term Pattern may comprise a reference to

-   -   a) the name of a word, and optionally,    -   b) its part of speech tag (a simple lexical tag or phrasal tag),        and optionally,    -   c) synonyms of the word.

A configurational Pattern may consist of the form A Operator B, wherethe Operator is Boolean:

A configuration is any expression in the notation used to representsyntactic descriptions (for instance, trees or labelled bracketing).

A configurational Pattern may consist of the form A Operator B, wherethe Operator is of two types:

-   -   a) Dominance, and    -   b) Precedence.

A configurational Pattern may consist of the form A Dominates B, where

-   -   a) A is a syntactic constituent (which can be identified by a        phrasal tag, though not necessarily);    -   b) B is any Pattern    -   (the entire Pattern matches any configuration where what B        refers to is a subconstituent of A).

A configurational Pattern A Dominates B may be “wide-matched,” meaningthat the interval of A in a text is returned instead of B; that is, theinterval of a dominant expression (A) is returned rather than theinterval of a dominated expression (B). The term “interval” was definedearlier as referring to a consecutive sequence of words in a text. Aninterval can be represented in various ways, for instance, as twointegers separated by a dash, where the first integer is the startposition; and the second integer is the end position. For example, inthe Text Tree in FIG. 11, cat occupies 1-1, and the cat occupies 0-1.

A configurational Pattern may consist of the form A Precedes B, where

-   -   a) A is any Pattern;    -   b) B is any Pattern    -   (the entire Pattern matches any configuration where the        constituent A refers to is before the constituent B refers to).

Boolean operators can be applied to any Patterns to obtain furtherPatterns.

Any of the Patterns thus defined is a CSL Expression.

A Pattern is fully recursive (i.e., subject to Patterns satisfying thearguments of the Operators defined above).

A Macro in the CSL represents a Pattern in a compact, parameterized formand can be used wherever a Pattern is used.

1. A method of information retrieval, performed on a computer systemthat matches text in documents and other text-forms against user-defineddescriptions of concepts, comprising: a) identification of linguisticentities in the text of documents and other text-forms; b) annotation ofsaid identified linguistic entities in a text markup language to producelinguistically annotated documents and other text-forms; c) storage ofsaid linguistically annotated documents and other text-forms; d)identification of concepts using linguistic information, where saidconcepts are represented in a concept specification language and saidconcepts occur in one of: 1) said text of documents and other text-formsin which linguistic entities have been identified in step a); or 2) saidlinguistically annotated documents and other text-forms of step b); or3) stored linguistically annotated documents and other text-forms ofstep c); e) annotation of said identified concepts in said text markuplanguage to produce conceptually annotated documents and othertext-forms; f) storage of said conceptually annotated documents andother text-forms; g) defining and learning concept representations ofsaid concept specification language, including: 1) marking up instancesof concepts in the text of documents and other text-forms; 2) creatingnew concept representations in the concept specification language fromsaid marked up instances of concepts; and 3) adding and, if necessary,integrating said new concept representations in the conceptspecification language with pre-existing concept representations in saidlanguage; h) checking user-defined descriptions of concepts representedin said concept specification language; and i) retrieval by matchingsaid user-defined descriptions of concepts against said conceptuallyannotated documents and other text-forms.
 2. The method according toclaim 1 wherein said identification of linguistic entities in the textof documents and other text-forms comprises identification ofmorphological, syntactic, and semantic entities.
 3. The method accordingto claim 2 wherein said identification of linguistic entities in thetext of documents and other text-forms comprises identifying words andphrases, and establishing dependencies between words and phrases.
 4. Themethod according to claim 3 wherein said identification of linguisticentities in the text of documents and other text-forms is accomplishedby a method selected from one or more of: a) preprocessing of text ofdocuments and other text-forms; b) tagging of text of documents andother text-forms; c) parsing of text of documents and other text-forms.5. The method according to claim 4 wherein annotation of said identifiedlinguistic entities in the text of documents and other text-forms islinguistic annotation and produces a representation of linguisticallyannotated documents and other text-forms in a text markup language. 6.The method according to claim 5 wherein said linguistically annotateddocuments and other text-forms are stored.
 7. The method according toclaim 1 wherein in said identification of concepts using linguisticinformation said concepts are represented in a concept specificationlanguage and said concepts occur in one of: a) said text of documentsand other text-forms in which linguistic entities have been identifiedto produce said linguistically annotated documents and other text-formsby means of a method comprising: 1) identification of morphological,syntactic, and semantic entities; 2) identification or words andphrases, and establishment of dependencies between words and phrases;and 3) at least one of: i) preprocessing of text of documents and othertext-forms; ii) tagging of text of documents and other text-forms; iii)parsing of text of documents and other text-forms; or b) saidlinguistically annotated documents and other text-forms wherein saidannotation of said identified linguistic entities in the text ofdocuments and other text-forms comprises linguistic annotation andproduces a representation of linguistically annotated documents andother text-forms in a text markup language; or c) said storedlinguistically annotated documents and other text-forms in a text markuplanguage.
 8. The method according to claim 7 wherein said conceptspecification language allows representations to be defined for conceptsin terms of a linguistics-based pattern or set of patterns, where eachpattern consists of words, phrases, other concepts, and relationshipsbetween words, phrases, and concepts.
 9. The method according to claim 8wherein said identification of concepts using linguistic information,when used with said concept specification language, consists of applyingrepresentations of concepts for the purpose of identifying concepts. 10.The method according to claim 7 wherein annotation of said identifiedconcepts in linguistically annotated documents and other text-forms isconceptual annotation and produces a representation of conceptuallyannotated documents and other text-forms in a text markup language. 11.The method according to claim 10 wherein said conceptually annotateddocuments and other text-forms are stored.
 12. The method according toclaim 7 wherein said identification of concepts uses linguisticinformation, and said concepts are represented in a conceptspecification language, as a result of methods for identifyingcomprising: a) compiling an expression from said concept specificationlanguage into finite state automata (FSAs); b) matching said FSAsagainst linguistic entities in said linguistically annotated text. 13.The method according to claim 12 wherein concepts from said conceptspecification language are compiled into finite state automata (FSAs)and said compilation into FSAs comprises one or both of the following:a) the grammar from the parser used within the method to parselinguistically annotated text; and b) sets of synonyms.
 14. The methodaccording to claim 7 wherein said identification of concepts useslinguistic information, and said concepts are represented in a conceptspecification language, as a result of methods for identifying conceptscomprising recursive descent matching which consists of traversing anexpression in said concept specification language and recursivelymatching constituents of said expression against linguistic entities inlinguistically annotated text.
 15. The method according to claim 14wherein said identification of concepts uses recursive descent matchingand wherein said recursive descent matching comprises sets of synonyms.16. The method according to claim 7 wherein said identification ofconcepts uses linguistic information, and said concepts are representedin a concept specification language, as a result of methods foridentifying concepts which comprise bottom-up matching comprising: a)generating in a bottom-up fashion multiple spans, where each span is: 1)a word or constituent and, optionally, structural information about theword or constituent, or 2) a set of words and constituents that followeach other and, optionally, structural information about the words orword and constituents or constituent; b) generating in a bottom-upfashion spans consumed by single-term patterns in an expression in saidconcept specification language; c) generating in a bottom-up fashionspans consumed by operators in an expression in said conceptspecification language; and d) matching in a bottom-up fashion saidspans against linguistic entities in linguistically annotated text. 17.The method according to claim 16 wherein identification of conceptsusing bottom-up matching, where said bottom-up matching comprises setsof synonyms.
 18. The method according to claim 7 wherein saididentification of concepts uses linguistic information, and saidconcepts are represented in a concept specification language, as aresult of methods for identifying concepts that are index-basedcomprising use of an inverted index, where: a) said inverted indexcontains words, constituents, and tags for linguistic information,comprising syntactic information, from linguistically annotated text; b)said inverted index contains spans for said words, constituents, andtags from linguistically annotated text; and c) where each span is: 1) aword or constituent and, optionally, structural information about theword or constituent, or 2) a set of words and constituents that followeach other and, optionally, structural information about the words orword and constituents or constituent.
 19. The method according to claim18 wherein said identification of concepts uses linguistic information,and said concepts are represented in a concept specification language,as a result of index-based methods for identifying concepts comprisingindex-based matching, where said index-based matching comprises: a)using backtracking to resolve the constraints of operators in anexpression in said concept specification language; b) attachingiterators to all items in the expression in said concept specificationlanguage; c) using the iterators to produce matches of all items in theexpression in said concept specification language against text in theinverted index; d) maintaining a state for the iterator for each item inthe expression in said concept specification language where that stateis used to determine whether or not it has been processed before in thematch of said expression against said inverted index, and also relevantinformation about the progress of the match; e) maintaining a state forthe iterator for each item that is a word in the expression in saidconcept specification language where that state comprises: a list ofapplicable synonyms of the word in question, and the current synonymbeing used for matching; an iterator into the inverted index that canenumerate all instances of the word in said index, and which records thecurrent word; f) during the course of a match, each item in theexpression in said concept specification language is tested, and ifsuccessful, returns a set of spans covering the match of itscorresponding sub-expression (i.e., components of said expression). 20.The method according to claim 19 wherein said identification of conceptsuses index-based matching, where said index-based matching comprisessets of synonyms.
 21. The method according to claim 18 wherein saididentification of concepts uses linguistic information, and saidconcepts are represented in a concept specification language, as aresult of index-based methods for identifying concepts comprisingcandidate checking index-based matching where said candidate checkingindex-based matching comprises identifying sets of candidate spans,where: a) a candidate span is a span that may contain a concept to beidentified; b) any span that is not covered by a candidate span from thesets of candidate spans is one that cannot contain a concept to beidentified; c) each sub-expression of an expression in the conceptspecification language is associated with a procedure; and d) each suchprocedure is used to generate candidate spans or to check whether agiven span is a candidate span.
 22. The method according to claim 21wherein said identification of concepts uses linguistic information, andsaid concepts are represented in a concept specification language, as aresult of index-based methods for identifying concepts comprisingcandidate checking index-based matching where said candidate checkingindex-based matching produces candidate spans that serve as input toconcept identification methods comprising compiling and matching finitestate automata, recursive descent matching, bottom-up matching, andindex based matching.
 23. The system according to claim 21 wherein saidCSL processor, accessed by said user interface, comprises a CSL Conceptand Concept Rule learner, and a CSL query checker.
 24. The systemaccording to claim 23 wherein said CSL Concept and Concept Rule learnercomprises: a) highlighting instances of Concepts in the text ofdocuments; b) creating new CSL Rules from said highlighted instances ofConcepts; c) creating new CSL Concepts from said CSL Rules; d) addingand, if necessary, integrating said new CSL Concepts and Concept Ruleswith pre-existing CSL Concepts and Concept Rules.
 25. The systemaccording to claim 24 wherein creating new CSL Rules comprises: a) usingthe Concept identifier to match together CSL vocabulary specificationsand highlighted linguistically annotated documents; b) defininglinguistic variants; c) adding synonyms from a set of synonyms; d)adding parts of speech.
 26. The system according to claim 24 whereinsaid CSL Concept and Concept Rule learner comprises means for: a)highlighting instances of Concepts in the text of documents to producehighlighted documents; b) linguistic annotation of said documents toproduce highlighted linguistically annotated documents; c) saidhighlighted text documents can be either produced on demand or stored inTML or other formats; d) said highlighted linguistically annotateddocuments can be either produced on demand or stored in TML or otherformats; e) producing new and CSL Concept Rules from said highlightedinstances of Concepts in said highlighted linguistically annotateddocuments; and f) adding and, if necessary, integrating said new CSLConcepts and Concept Rules with pre-existing CSL Concepts and ConceptRules.
 27. The system according to claim 23 wherein said CSL Concept andConcept Rule learner comprises means for: a) highlighting instances ofConcepts in the text of linguistically annotated documents to producehighlighted linguistically annotated documents; where b) saidlinguistically annotated documents can be either produced on demand orstored in TML or other formats; and c) said highlighted linguisticallyannotated documents can be either produced on demand or stored in TML orother formats; d) producing new CSL Concept Rules from said highlightedinstances of Concepts in said highlighted linguistically annotateddocument; and e) adding and, if necessary, integrating said new CSLConcepts and Concept Rules with pre-existing CSL Concepts and ConceptRules.
 28. The system according to claim 23 wherein said CSL querychecker, accessed by said user interface, takes as input a proposed CSLquery and, if all queries are known in advance, passes said query to theretriever.
 29. The system according to claim 23 wherein said CSL querychecker accessed by said user interface, takes as input a proposed CSLquery and, if all queries are not known in advance, matches said queryagainst known CSL Concepts and Concept Rules and, if a match is found,then the query is parsed with a CSL parser and passed to the retriever.30. The system according to claim 23 wherein said CSL query checker,accessed by said user interface, takes as input a proposed CSL queryand, if all queries are not known in advance, matches said query againstknown CSL Concepts and Concept Rules and, if a match is not found, thenthe query is parsed with a CSL parser and added to the list of CSLConcepts and Concept Rules to be annotated, which are then passed to theannotator.
 31. The method according to claim 7 wherein saididentification of concepts uses linguistic information, and saidconcepts are represented in a concept specification language, as aresult of methods for identifying concepts comprising using an invertedindex with compiling and matching finite state automata, recursivedescent matching, bottom-up matching, and index based matching.
 32. Themethod according to claim 1 wherein said concept representations to bedefined and learned comprise hierarchies, rules, operators, patterns,and macros.
 33. The method according to claim 1 wherein creating newconcept representations of said concept specification languagecomprises: a) using concept identification methods to match togetherconcept specification language vocabulary specifications and highlightedlinguistically annotated documents and other text-forms; b) defininglinguistic variants; c) adding synonyms from a set of synonyms; d)adding parts of speech.
 34. The method according to claim 1 furthercomprising the step of defining and learning said conceptrepresentations of said concept specification language comprising: a)highlighting instances of concepts in the text of linguisticallyannotated documents and other text-forms to produce highlightedlinguistically annotated documents and other text-forms; where b) saidlinguistically annotated documents and other text-forms are stored orproduced on demand; and c) said highlighted linguistically annotateddocuments and other text-forms are stored or produced on demand; d)producing new concept representations in the concept specificationlanguage from said highlighted instances of concepts in said highlightedlinguistically annotated documents and other text-forms; and e) addingand, if necessary, integrating said new concept representations in theconcept specification language with pre-existing concept representationsin said language.
 35. The method according to claim 1 further comprisingthe step of defining and learning said concept representations of saidconcept specification language comprising: a) marking up instances ofconcepts in the text of documents and other text-forms to producehighlighted documents and other text-forms; b) identification oflinguistic entities in said highlighted documents and other text-formsand annotation of said documents and other text-forms to producehighlighted linguistically annotated documents and other text-forms; c)said highlighted text documents and other text-forms are stored orproduced on demand; d) said highlighted linguistically annotateddocuments and other text-forms are stored or produced on demand; e)producing new concept representations in the concept specificationlanguage from said highlighted instances of concepts in said highlightedlinguistically annotated documents and other text-forms; and f) addingand, if necessary, integrating said new concept representations in theconcept specification language with pre-existing concept representationsin said language.
 36. The method according to claim 1 wherein saiduser-defined descriptions of concepts represented in said conceptspecification language comprise user queries to an information retrievalsystem, said user queries being represented in said conceptspecification language.
 37. The method according to claim 36 wherein, ifall known queries are represented in said concept specificationlanguage, then a proposed query represented in said conceptspecification language is subsequently used by said retrieval method.38. The method according to claim 36 wherein, if all queries are notknown in advance to be represented in said concept specificationlanguage, then a proposed query represented in said conceptspecification language is matched against a pre-stored repository ofqueries represented in said concept specification language and, if amatch is found, then the query is subsequently used by said method ofretrieval.
 39. The method according to claim 36 wherein, if all queriesare not known in advance to be represented in said concept specificationlanguage, then a proposed query represented in said conceptspecification language is matched against a pre-stored repository ofqueries represented in said concept specification language and, if amatch is not found, then the query is subsequently used by said methodof conceptual annotation.
 40. The method according to claim 36 whereinretrieval matches said user-defined descriptions against said annotatedtext and retrieves matching documents and other text-forms.
 41. Themethod according to claim 1 comprising: a) said annotation of saididentified linguistic entities in a text markup language to producelinguistically annotated documents and other text-forms comprisesannotation of said identified linguistic entities in a Text MarkupLanguage (TML) to produce linguistically annotated documents and othertext-forms; b) said identification of concepts using linguisticinformation comprises identification of Concepts and Concept Rules usinglinguistic information, where said Concepts and Concept Rules arerepresented in a Concept Specification Language (CSL) and saidConcepts-to-be-identified and Concept Rules-to-be-identified occur inone of: 1) said text of documents and other text-forms in whichlinguistic entities have been identified; 2) said linguisticallyannotated documents and other text-forms; or 3) said storedlinguistically annotated documents and other text-forms; c) saidannotation of said identified concepts in said text markup language toproduce conceptually annotated documents and other text-forms comprisesannotation of said identified Concepts and Concept Rules in said TML toproduce conceptually annotated documents and other text-forms; d)defining and learning CSL Concepts and Concept Rules; e) said checkinguser-defined descriptions of concepts represented in said conceptspecification language comprises checking user-defined descriptions ofConcepts and Concept Rules represented in CSL; and f) said retrieval bymatching said user-defined descriptions of concepts against saidconceptually annotated documents and other text-forms comprisesretrieval by matching said user-defined descriptions of CSL Concepts andConcept Rules against said conceptually annotated documents and othertext-forms.
 42. A system for implementing said method according to claim41 comprising one of: a) a server, comprising a communications interfaceto one or more clients over a network or other communication connection,one or more central processing units (CPUs), one or more input devices,one or more program and data storage areas comprising a module orsubmodules for an information retriever, and one or more output devices;and b) one or more clients, comprising a communications interface to aserver over a network or other communication connection, one or morecentral processing units (CPUs), one or more input devices, one or moreprogram and data storage areas comprising one or more submodules for aninformation retriever, and one or more output devices.
 43. The system ofclaim 42 wherein the information retriever takes as input text indocuments and other text-forms in the form of a signal from one or moreinput devices to a user interface, and carries out predeterminedinformation retrieval processes to produce a collection of text indocuments and other text-forms, which are output from the user interfacein the form of a signal to one or more output devices.
 44. The systemaccording to claim 43 wherein predetermined information retrievalprocesses, accessed by said user interface, comprise a text documentannotator, CSL processor, CSL parser, and text document retriever. 45.The system according to claim 44 wherein said text document annotator,accessed by said user interface, comprises a document loader from adocument database, which passes text documents to the annotator, andoutputs one or more annotated documents.
 46. The system according toclaim 45 wherein said annotator takes as input one or more textdocuments, outputs one or more annotated documents, and is comprised ofa linguistic annotator which passes linguistically annotated documentsto a conceptual annotator.
 47. The system according to claim 46 whereinsaid linguistically annotated documents, are annotated with arepresentation in a Text Markup Language.
 48. The system according toclaim 46 wherein said Text Markup Language (TML) has the syntax of XML,and conversion to and from TML is accomplished with an XML converter.49. The system according to claim 46 wherein said linguistic annotator,taking as input one or more text documents, and outputting one or morelinguistically annotated documents, comprises one or more of thefollowing: a) a preprocessor; b) a tagger; and c) a parser.
 50. Thesystem according to claim 49 wherein said preprocessor, taking as inputone or more text documents or the documents output by any otherappropriate linguistic identification process, and producing as outputone or more preprocessed documents, comprises means for one or more ofthe following: a) breaking text into words; b) marking phraseboundaries; c) identifying numbers, symbols, and other punctuation; d)expanding abbreviations; and e) splitting apart contractions.
 51. Thesystem according to claim 49 wherein said tagger takes as input a set oftags, one or more preprocessed documents or the documents output by anyother appropriate linguistic identification process and produces asoutput one or more documents tagged with the appropriate part of speechfrom a given tagset.
 52. The system according to claim 49 wherein saidparser takes as input one or more tagged documents or the documentsoutput by any other appropriate linguistic identification process andproduces as output one or more parsed documents.
 53. The systemaccording to claim 46 wherein said conceptually annotated documents areannotated with a representation in TML.
 54. The system according toclaim 53 wherein said conceptually annotated documents are stored. 55.The system according to claim 46 wherein said input of one or morelinguistically annotated documents to said conceptual annotatorcomprises at least one of the following sources: a) the linguisticannotator directly; b) storage in some linguistically annotated formsuch as the representation produced by the final linguisticidentification process of the linguistic annotator; and c) storage inTML followed by conversion from TML to the representation produced bythe final linguistic identification process of the linguistic annotator.56. The system according to claim 46 wherein said conceptual annotatorcomprises a Concept identifier.
 57. The system according to claim 56wherein said Concept identifier produces conceptually annotateddocuments as a result of: a) compiling CSL into finite state automata(FSAs); b) matching said FSAs against linguistically annotateddocuments.
 58. The system according to claim 57 wherein said compilationinto FSAs also includes as part of compilation one or both of thefollowing: a) the grammar from the parser used by the system to parselinguistically annotated documents; and b) sets of synonyms.
 59. Thesystem according to claim 56 wherein said Concept identifier producesconceptually annotated documents as a result of recursive descentmatching which consists of traversing an expression in CSL andrecursively matching constituents of said expression against linguisticentities in linguistically annotated text.
 60. The system according toclaim 59 wherein said recursive descent matching comprises sets ofsynonyms.
 61. The system according to claim 56 wherein said Conceptidentifier produces conceptually annotated documents as a result ofbottom-up matching which comprises: a) generating in a bottom-up fashionmultiple spans, where each span is: 1) a word or constituent and,optionally, structural information about the word or constituent, or 2)a set of words and constituents that follow each other and, optionally,structural information about the words or word and constituents orconstituent; b) generating in a bottom-up fashion spans consumed bysingle-term patterns in an expression in CSL; c) generating in abottom-up fashion spans consumed by operators in an expression in CSL;and d) matching in a bottom-up fashion said spans against linguisticentities in linguistically annotated documents.
 62. The system accordingto claim 61 wherein said bottom-up matching, where bottom-up matchingcomprises sets of synonyms.
 63. The system according to claim 61 whereinsaid Concept identifier using index-based methods produces conceptuallyannotated documents as a result of index-based matching, where saidindex-based matching comprises: a) using backtracking to resolve theconstraints of CSL operators in an expression in CSL; b) attachingiterators to all items in the CSL expression; c) using the iterators toproduce matches of all items in the CSL expression against text in theinverted index; d) maintaining a state for the iterator for each item inthe CSL expression where that state is used to determine whether or notit has been processed before in the match of said expression againstsaid inverted index, and also relevant information about the progress ofthe match; e) maintaining a state for the iterator for each item that isa word in the expression in CSL where that state comprises the followinginformation: a list of applicable synonyms of the word in question, andthe current synonym being used for matching; an iterator into theinverted index that can enumerate all instances of the word in saidindex, and which records the current word; f) during the course of amatch, each item in the CSL expression is tested, and if successful,returns a set of spans covering the match of its correspondingsub-expression (i.e., components of said CSL expression).
 64. The systemaccording to claim 63 wherein said index-based matching, whereindex-based matching comprises sets of synonums.
 65. The methodaccording to claim 61 wherein said identification of concepts useslinguistic information, and said concepts are represented in a conceptspecification language, as a result of index-based methods foridentifying concepts comprising candidate checking index-based matchingwhere said candidate checking index-based matching comprises identifyingsets of candidate spans, where a) a candidate span is a span that maycontain a Concept to be identified (matched); b) any span that is notcovered by a candidate span from the sets of candidate spans is one thatcannot contain a Concept to be identified (matched); c) eachsub-expression of a CSL expression is associated with a procedure; d)each such procedure is used to generate candidate spans or to checkwhether a given span is a candidate span.
 66. The system according toclaim 65 wherein said candidate spans produced by said candidatechecking index-based matching serve as input to Concept identificationmethods comprising compiling and matching finite state automata,recursive descent matching, bottom-up matching, and index basedmatching.
 67. The system according to claim 61 wherein said Conceptidentifier produces conceptually annotated documents as a result ofmethods for identifying Concepts comprising using an inverted index withcompiling and matching finite state automata, recursive descentmatching, bottom-up matching, and index based matching.
 68. The systemaccording to claim 56 wherein said Concept identifier producesconceptually annotated documents as a result of methods for identifyingConcepts that are index-based comprising use of an inverted index,where: a) said inverted index contains words, constituents, and tags forlinguistic information from linguistically annotated text; b) saidinverted index contains spans for said words, constituents, and tagsfrom linguistically annotated text; and c) where a span is: 1) a word orconstituent and, optionally, structural information about the word orconstituent, or 2) a set of words and constituents that follow eachother and, optionally, structural information about the words or wordconstituents or constituent.
 69. The system according to claim 46wherein said conceptual annotator takes as input one or morelinguistically annotated documents, a list of CSL Concepts and ConceptRules for annotation, and optionally data from a synonym resource, andoutputs one or more conceptually annotated documents.
 70. The systemaccording to claim 44 wherein said CSL parser takes as input a synonymdatabase, CSL query, and CSL Concepts and Rules, and outputs CSLConcepts and Rules for annotation as a result of the following: a) wordcompilation; b) Concept compilation; c) downward synonym propagation;and d) upward synonym propagation.
 71. The system according to claim 44wherein said text document retriever, accessed by said user interface,comprises a retriever which takes one or more annotated documents asinput, passes retrieved and categorized documents to a TML converter,which passes them to a document viewer.
 72. The system according toclaim 41 wherein said user-defined descriptions of CSL Concepts andConcept Rules comprise user queries to an information retrieval system,said user queries being represented in CSL.
 73. The method according toclaim 41 wherein a tag hierarchy in the CSL is a set of declarations,each declaration relating a tag to a set of tags, declaring that each ofthe latter tags is to be considered an instance of the former tag. 74.The method according to claim 41 wherein a Concept in the CSL is used torepresent concepts.
 75. The method according to claim 74 wherein aConcept in the CSL can either be global or internal to other Concepts.76. The method according to claim 74 wherein a Concept in the CSL useswords and other Concepts in the definition of Concept Rules.
 77. Themethod according to claim 76 wherein a Concept Rule in the CSL comprisesan optional name internal to the Concept followed by a Pattern.
 78. Themethod according to claim 77 wherein a Pattern in the CSL maymatch: a)single terms in an annotated text (a “single-term Pattern”); or b) someconfiguration in an annotated text (a “configurational Pattern”). 79.The method according to claim 78 wherein a configurational Pattern inthe CSL consists of the form A Operator B, where the Operator isBoolean.
 80. The method according to claim 79 wherein a Boolean operatorin the CSL can be applied to any Patterns to obtain further Patterns.81. The method according to claim 78 wherein a configurational Patternin the CSL is any expression in the notation used to represent syntacticdescriptions.
 82. The method according to claim 81 wherein aconfigurational Pattern in the CSL consists of the form A Operator B,where the Operator is of two types: a) Dominance, and b) Precedence. 83.The method according to claim 82 wherein a configurational Pattern inthe CSL consists of the form A Dominates B, where: a) A is a syntacticconstituent (which can be identified by a phrasal tag, though notnecessarily); b) B is any Pattern; and c) the entire Pattern matches anyconfiguration where what B refers to is a subconstituent of A.
 84. Themethod according to claim 83 wherein a configurational Pattern in theCSL of the form A Dominates B is wide-matched, where said wide-matchingreturns the interval of the dominant expression A in a text is returnedinstead of the interval of the dominated expression B, and where saidinterval is a consecutive sequence of words in a text that is commonlythough not necessarily represented as two integers separated by a dash.85. The method according to claim 84 wherein a configurational Patternin the CSL consists of the form A Precedes B, where: a) A is anyPattern; b) B is any Pattern; and c) the entire Pattern matches anyconfiguration where what B refers to is a subconstituent of A.
 86. Themethod according to claim 78 wherein any of the Patterns defined in theCSL is a CSL Expression.
 87. The method according to claim 78 wherein aPattern defined in the CSL is fully recursive.
 88. The method accordingto claim 78 wherein a Macro in the CSL represents a Pattern in acompact, parameterized form and can be used wherever a Pattern is used.89. The method according to claim 78 wherein a single-term Pattern inthe CSL comprises a reference to: a) the name of a word; b) optionally,its part of speech tag; and c) optionally, synonyms of the word.
 90. Themethod according to claim 41 wherein said Concepts, represented in saidCSL, derive from the sublanguages used to analyze event-basedspecialized domains comprising insurance claims, business and financialreports, police incident reports, medical reports, and aviation incidentreports.
 91. A system for implementing said method according to claim 1comprising one of: a) a server, comprising a communications interface toone or more clients over a network or other communication connection,one or more central processing units (CPUs), one or more input devices,one or more program and data storage areas comprising a module orsubmodules for an information retriever, and one or more output devices;and b) one or more clients, comprising a communications interface to aserver over a network or other communication connection, one or morecentral processing units (CPUs), one or more input devices, one or moreprogram and data storage areas comprising one or more submodules for aninformation retriever, and one or more output devices.
 92. The system ofclaim 91 wherein the information retriever takes as input text indocuments and other text-forms in the form of a signal from one or moreinput devices to a user interface, and carries out predeterminedinformation retrieval processes to produce a collection of text indocuments and other text-forms, which are output from the user interfacein the form of a signal to one or more output devices.
 93. The systemaccording to claim 92 wherein predetermined information retrievalprocesses, accessed by said user interface, comprises: a) identificationof linguistic entities in the text of documents and other text-forms; b)annotation of said identified linguistic entities in a text markuplanguage to produce linguistically annotated documents and othertext-forms; c) storage of said linguistically annotated documents andother text-forms; d) identification of concepts using linguisticinformation, where said concepts are represented in a conceptspecification language and said concepts to be identified occur in oneof: 1) said text of documents and other text-forms in which linguisticentities have been identified in step a), or 2) said linguisticallyannotated documents and other text-forms of step b); or 3) storedlinguistically annotated documents and other text-forms of step c); e)annotation of said identified concepts in said text markup language toproduce conceptually annotated documents and other text-forms; f)storage of said conceptually annotated documents and other text-forms;g) defining and learning concept representations of said conceptspecification language; h) checking user-defined descriptions ofconcepts represented in said concept specification language; and i)retrieval by matching said user-defined descriptions of concepts againstsaid conceptually annotated documents and other text-forms.
 94. Themethod according to claim 1 wherein said concepts, represented in saidconcept specification language, derive from the sublanguages used toanalyze event-based specialized domains comprising insurance claims,business and financial reports, police incident reports, medicalreports, and aviation incident reports.